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Chapter 1 


Introduction 


This chapter introduces some features of the ’C62xx microprocessor and 
discusses the basic process for creating code. 
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TMS320C62xx Architecture / TMS320C62xx Pipeline 


1.1 TMS320C62xx Architecture 


The ’C62xx is a fixed-point digital signal processor (DSP) and is the first DSP 
to use the VelociT|™ architecture. VelociTl is a high-performance, advanced 
very long instruction word (VLIW) architecture, making it an excellent choice 
for multichannel, multifunction, and performance-driven applications. 


The ’C62xx DSPs are based on the ’C62xx CPU, which consists of: 


OUUUOOUULU 


Program fetch unit 

Instruction dispatch unit 

Instruction decode unit 

Two data paths, each with four functional units 
32, 32-bit registers 

Control registers 

Control logic 

Test, emulation, and interrupt logic 


1.2 TMS320C62xx Pipeline 


The ’C62xx pipeline has several features that provide optimum performance, 
low cost, and simple programming. 


Ly 


Ly 


Increased pipelining eliminates traditional architectural bottlenecks in pro- 
gram fetch, data access, and multiply operations. 


Pipeline control is simplified by eliminating pipeline locks. 
The pipeline can dispatch eight parallel instructions every cycle. 


Parallel instructions proceed simultaneously through the same pipeline 
phases. 


1.3 Code Development Flow 


Code Development Flow 


You can achieve the best performance from your ’C62xx code if you follow this 
flow when you are writing and debugging your code: 


Phase 1 


Phase 2 


Phase 3 


Write C code 


Yes 
Complete 
No 


Refine C code 


Yes 
Complete 
No 


Yes More C 
optimization? 


-— Write serial assembly 
Assembly optimize 


Profile 


No 
No Cn> 
Yes 
Complete 


Introduction 1-3 


Code Development Flow 


The following lists the phases in the three-step software development flow 
shown on page 1-3, and the goal for each phase: 


Phase 


{ 


Goal 


Begin by developing C code and using the ’C6x compiler. This 
requires no knowledge of the ’C62xx, however you could use 
the ’C6x profiling tools that are described in the TMS320C6x 
C Source Debugger User’s Guide to pinpoint any inefficient 

areas of your code. 


Refine your C code using procedures such as compiler 
options, intrinsics, statements, data types, and code trans 
formations. You can realize substantial gains from the perfor 
mance of your C code by using these simple methods. Proceed 
to the next phase if you are still dissatisfied with the 
performance of your code. 


Extract the inefficient areas from your C code and rewrite them 
in assembly optimizer source code. 


(Od a T=1 0) (=) a 


Optimizing C Code 


The goal in this chapter is to maximize C performance by using compiler 
options, intrinsics, and code transformations. This chapter discusses various 
ways to optimize your C code by providing specific examples and discussing 
topics such as: 


Lj The compiler and its options 
Lj Intrinsics 

J Software pipelining 

_j Loop unrolling 
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Analyzing C Code Performance 


2.1 Analyzing C Code Performance 


The following techniques can help you to analyze the performance of specific 
code regions: 


(1 Use the clock() and printf() functions in C to time the performance of 
specific code regions. You can use the standalone loader (load6x) to run 
the code shown in Example 2-1. 


1 Use the profiler mode in the debugger as explained in the TMS320C6x 
C Source Debugger User’s Guide. 


_j Use breakpoints, the clk register, and the runb command in the debugger 
as described in the TMS320C6x C Source Debugger User’s Guide. 


Most often the critical performance areas in your code are loops. The easiest 
way to optimize a loop is by extracting it into a separate file that can be trans- 
formed, recompiled, and run standalone. As you use the techniques described 
in this chapter to optimize your C code, you can then evaluate the results by 
running the code and looking at the instructions generated by the compiler. 


Example 2-1. Using the clock() Function 


#include <stdio.h> 
#include <time.h> 


short a[100], b[100], c[100]; 
void vecsum(short *a, short *b, short *c, int); 


main () 


{ 


clock_t overhead, start, stop; 


[BK KK IK IK I I I I I I I I I I I I I I HO He / 


/* COMPUTE THE OVERHEAD OF CALLING CLOCK Ry 


[KK IK IK I I I I I I I I I I I I HH / 


start = clock(); 
stop = clock(); 
overhead = stop start; 


[KKK IK IK IK I I I I I I O/ 


/* CALL AND TIME THE VECSUM ROUTINE. */ 
[RK KK IK IK I I I I I HH / 
start = clock(); 

vecsum(a, b, c, 100); 

stop = clock(); 


printf (”vecsum cycles: %d\n”, stop - start - overhead) ; 
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Compiler Options 


2.2 Compiler Options 
Table 2-1 defines the options mentioned in this chapter. 


[j Although —o3 is preferable, at a minimum use the —o option. 


Lj Use the -—pm (program level optimization) option for as much of your 
program as possible. 


For a complete description of these and other options, see the TMWS320C6x 
Optimizing C Compiler User’s Guide. 


Table 2-1. Subset of Compiler Options 


Option Description 
-0 Enables software pipelining and other optimizations in the 
compiler. 
-—pm Enables program level optimization. 
—mt Enables the compiler to make assumptions that allow it to 


be more aggressive with certain optimizations. 
—ms Ensures that redundant loops are not generated. 


—k Keeps the assembly file so that you can inspect it. 
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Tips on Data Types 


2.3 Tips on Data Types 


The ’C6x compiler defines different sizes for each data type: 


Lj short 16 bits 
Lint 32 bits 
Li long 40 bits 


Based on the size of each data type, follow these guidelines: 


Lj Avoid code that assumes that int and long types are the same size 
because the ’C6x compiler uses 40-bit operations for jong values. 


[1 Use the short data type for multiplication inputs whenever possible 
because this data type provides the most efficient use of the 16-bit multi- 
plier in the ’C62xx. 


(Lj Use int or unsigned int types for loop counters. 


2.4 Using Intrinsics 


Using Intrinsics 


One way to optimize your C code is by using intrinsics, which are special func- 
tions that map directly to inlined 'C62xx instructions. 


a 


a 


Intrinsics are specified with a leading underscore and are accessed by 
calling them as you do a function. 


All instructions that are not easily expressed in C code are supported as 
intrinsics by the ’C6x compiler. 


Traditionally, saturated addition can be expressed in C code only by writing a 
multicycle function, such as the one in Example 2—2. Example 2-3 shows 
code that uses the _sadd/() intrinsic, which results in a single ‘C62xx instruc- 


tion. 


Table 2-2 lists the ‘C62xx intrinsics. For more information, see the 
TMS320C6x Optimizing C Compiler User’s Guide. 


Example 2-2. Saturated Add Without Intrinsics 


{ 


int sadd(int a, int b) 


int result; 
result =a +b; 
if (((a * b) & 0x80000000) == 0) 
if ((result * a) & 0x80000000) 
result = (a < 0) ? Ox80000000 : 
} 
} 


return (result); 


Ox7f£f£fFfffIf; 


Example 2-3. Saturated Add With Intrinsics 


result = _sadd(a,b) 
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Using Intrinsics 


Table 2-2. TMS320C6x C Compiler Intrinsics 


C Compiler Intrinsic 


int __add2(int src7, int src2); 


uint _clr(uint src2, uint csta, uint cstb); 


int _ext(uint src2, uint csta, int cstb); 


uint _extu(uint src2, uint csta, uint cstb); 


uint_Imbd(uint src7, uint src2): 


int_mpy(int src7, int src2); 
int_mpyus(uint src7, int src2); 
int_mpysu(int src7, uint src2); 
uint_mpyu(uint src7, uint src2); 


int_mpyh(int src7, int src2); 
int_mpyhus(uint src7, int src2); 
int_mpyhsu(int src7, uint src2); 
uint_mpyhu(uint src7, uint src2); 


int_mpyhl(int src7, int src2); 
int_mpyhuls(uint src7, int src2); 
int_mpyhslu(int src7, uint src2); 
uint_mpyhlu(uint src7, uint src2); 


int_mpylh(int src7, int src2); 
int_mpyluhs(uint src7, int src2); 
int_mpylshu(int src7, uint src2); 
uint_mpylhu(uint src7, uint src2); 
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Assembly 
Instruction 


ADD2 


CLR 


EXT 


EXTU 


LMBD 


MPY 
MPYUS 
MPYSU 
MPYU 


MPYH 
MPYHUS 
MPYHSU 
MPYHU 


MPYHL 
MPYHULS 
MPYHSLU 
MPYHLU 


MPYLH 
MPYLUHS 
MPYLSHU 
MPYLHU 


Description 


Adds the upper and lower halves of src1 to the 
upper and lower halves of src2 and returns the 
result. Any overflow from the lower half add will not 
affect the upper half add. 


Clears specified field in src2. The beginning and 
ending bits of the field to be cleared are specified 
by csta and cstb, respectively. 


Extracts specified field in src2, sign-extended to 
32 bits. The extract is performed by a shift left 
followed by a signed shift right; csta and cstb are 
the shift left and shift right amounts, respectively. 


Extracts specified field in src2, zero-extended to 
32 bits. The extract is performed by a shift left 
followed by a unsigned shift right; csta and cstb 
are the shift left and shift right amounts, respec- 
tively. 


Searches for a leftmost 1 or 0 of src2 determined 
by the LSB of src7. Returns the number of bits up 
to the bit change. 


Multiplies the 16 LSBs of src1 by the 16 LSBs of 
src2 and returns the result. Values can be signed 
or unsigned. 


Multiplies the 16 MSBs of src1 by the 16 MSBs of 
src2 and returns the result. Values can be signed 
or unsigned. 


Multiplies the 16 MSBs of src1 by the 16 LSBs of 
src2 and returns the result. Values can be signed 
or unsigned. 


Multiplies the 16 LSBs of src1 by the 16 MSBs of 
src2 and returns the result. Values can be signed 
or unsigned. 


Using Intrinsics 


Table 2-2. TMS320C6x C Compiler Intrinsics (Continued) 


C Compiler Intrinsic 


void _nassert(int); 


uint__norm(int src2); 
uint _Inorm(long src2); 


int_sadd(int src7, int src2); 
long _Isadd(int src7, long src2): 


long _sat(int src2); 


uint _set(uint src2, uint csta, uint cstb); 


int_smpy(int src7, int sr2); 

int_smpyh(int src7, int sr2); 
int_smpyhl(int src7, int sr2); 
int_smpylh(int src7, int sr2); 


uint _sshl(uint src2, uint src7); 


int_ssub(int src7, int src2); 
long _Issub(int src7, long src2): 


uint _sube(uint src7, uint src2); 


int_sub2(int src7, int src2); 


Assembly 
Instruction 


NORM 


SADD 


SAT 


SET 


SMPY 
SMPYH 
SMPYHL 
SMPYLH 


SSHL 


SSUB 


SUBC 


SUB2 


Description 


Generates no code. Tells the optimizer that the 
expression declared with the assert function is 
true; this gives a hint to the optimizer as to what 
optimizations might be valid. 


Returns the first nonredundant sign bit of src2. 


Adds srci1 to src2 and saturates the result. Returns 
the result. 


Converts a 40-bit value to an 32-bit value and 
saturates if necessary. 


Sets the specified field in src2 to all 1s and returns 
the initial src2 value. The beginning and ending 
bits of the field to be cleared are specified by csta 
and cstb, respectively. 


Multiplies src1 by src2, left shifts the result by one, 
and returns the result. If the result is Ox8000 0000, 
saturates the result to Ox7FFF FFFF. 


Shifts src2 left by the contents of src1, saturates 
the result to 32 bits, and returns the result. 


Subtracts src2 from src1, saturates the result size, 
and returns the result. 


Conditionally subtracts divide step and returns the 
result. 


Subtracts the upper and lower halves of src2 from 
the upper and lower halves of src7, and returns the 
result. Any borrowing from the lower half subtract 
does not affect the upper half subtract. 
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Memory Dependencies 


2.5 Memory Dependencies 


To schedule instructions in parallel, the compiler must determine the relation- 
ships, or dependencies, between instructions. A dependency means that one 
instruction must occur before another. Because only independent instructions 
can execute in parallel, dependencies inhibit parallelism. The compiler uses 
these guidelines when scheduling instructions: 


(1 Ifthe compiler cannot determine that two instructions are independent (for 
example, b does not depend on a), it assumes a dependency and sched- 
ules the two instructions sequentially. 


Lj Ifthe compiler can determine that two instructions are independent of one 
another, it can schedule them in parallel. 


.j Often it is difficult for the compiler to determine if instructions that access 
memory are independent. 


You can use the following techniques to help the compiler determine which 
instructions are independent: 


(J Usethe-pm (program-level optimization) option, which gives the compiler 
a global view of the whole program or module and allows it to be more 
aggressive ruling out dependencies. 


_j Use the const keyword to indicate which objects are not changed by a 
function. 


_j Use the-—mtcompiler, which allows the compiler to make assumptions that 
allow it to eliminate dependencies. 


Example 2—4 shows the C code for a basic vector sum. Figure 2-1 shows a 
dependency graph for the basic vector sum. 


Example 2-4. Basic Vector Sum 


void vecsum(short *sum, short *inl, short *in2, unsigned int N) 


i = 0; i < N; i++) 
[i] = inl[i] + in2[1i]; 


Memory Dependencies 


Figure 2-1. Dependency Graph for Vector Sum #1 
Load Load 


5 5 
{ Add elements 1 


{ 
Store to 


= memory 


The dependency graph in Figure 2—1 shows the following: 


J The paths from sum[i] back to in1[i] and in2{i] indicate that writing to sum 
may have an effect on the memory pointed to by either in7 or in2. 


Li Aread from in? or in2 cannot begin until the write to sum finishes, which 
creates an aliasing problem. Aliasing occurs when two pointers can point 
to the same memory location. For example, if vecsum/() is called in a pro- 
gram with the following statements, in? and sum alias each other since 
they both point to the same memory location: 


short a[10], b[10]; 
vecsum(a, a, b, 10); 


Although within a single iteration the reads to in? and in2finish before the store 
to sum, the ’C6x compiler is using software pipelining to execute multiple itera- 
tions in parallel and, therefore, must determine memory dependencies that 
exist across loop iterations. 


To help the compiler, you can qualify an object with the const keyword, which 
indicates that a variable or the memory referenced by a variable will not be 
changed by the function. 


Example 2-5 shows the vecsum/() example rewritten with the const keyword 
to indicate that the write to sum never changes the memory referenced by in? 
and in2. Figure 2—2 shows the revised dependency graph for the code in the 
inner loop. 
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Memory Dependencies 


Example 2-5. Vector Sum With const Keywords 


void vecsum2 (short *sum, const short *inl, const short *in2, 


{ 


int 1; 
for (i = 0; i < N; itt) 
sum[i] = inl[i] + in2[i]; 
} 


unsigned int N) 


Figure 2-2. Dependency Graph for Vector Sum #2 
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Memory Dependencies 


Example 2-6 shows the output of the compiler for the vector sum. The 
compiler finds better schedules when dependency paths are eliminated be- 
tween instructions. For this loop, the compiler found a software pipeline with 
a two-cycle kernel (compared with seven for the previous loop). 


Example 2-6. Compiler Output for Vector Sum Code 


L14: ; PIPE LOOP KERNEL 
ADD .L1X B4,A0,A5 

|| [BO] B ~S2 L14 

|| LDH .D1 *A3++,A0 
STH .D1 A5, *A4++ 

| | [BO] SUB .L2 BO,1,B0 

| | LDH .D2 *B5++,B4 


Another way to eliminate memory dependences is to use the —mt option, which 
allows the compiler to make assumptions that can eliminate memory depen- 
dency paths. For example, with this option enabled when compiling the code 
in Example 2-4, the compiler assumes that in? and in2 do not alias memory 
pointed to by sum, and therefore, would eliminate memory dependencies 
among the instructions that access those variables. 
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Trip Count Issues 


2.6 Trip Count Issues 


A trip count is the number of times that a loop executes, and the trip counter 
is the variable used to count each iteration. When the trip counter reaches a 
limit equal to the trip count, the loop terminates. The structure of a software 
pipeline requires the execution of a minimum number of loop iterations (a mini- 
mum trip count) in order to fill, or prime, the pipeline. 


Loops that are eligible for software pipelining have loop trip counters that count 
down. In most cases, the compiler can transform the loop to use a trip counter 
that counts down even if the original code was not written that way. 


For example, the optimizer transforms the loop in Example 2—7(a) to some- 
thing like the code in Example 2—7(b): 


Example 2-7. Trip Counters 
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(a) Original code 

for (i = 0; i < N; i++) /* i = trip counter, N = trip count */ 
(b) Optimized code 

for (i = Nj; i != 0; i--) /* Downcounting trip counter */ 
The minimum trip count for a software pipeline is determined by the number 
of iterations executing in parallel. 


Lj If the compiler knows the trip count, it can generate faster and more 
compact code. 


Lj If the compiler cannot determine that a loop always executes for the 
minimum trip count, it generates a redundant nonpipelined loop. That 
redundant nonpipelined is executed only when the runtime trip count is 
less than the minimum trip count; otherwise the software pipelined version 
of the loop is executed. 


In Example 2-5, the compiler cannot determine if the loop always 
executes more than the minimum trip count and, therefore, generates two 
versions of the loop: 


m Anonpipelined version that executes if Nis less than the minimum trip 
count. 


m Asoftware-pipelined version that executes if Nis equal to or greater 
than the minimum trip count. 


To indicate to the compiler that you do not want two versions of the loop, 
you can use the —ms option so that the compiler generates only the soft- 
ware-pipelined code and never generates a redundant loop; however, 
loops with an unknown trip count are not software-pipelined. 


Trip Count Issues 


Two techniques communicate trip-count information to the compiler: 


[J Use the -03 and —pm (program-level optimization) options to allow the 
optimizer to see the whole program or large parts of it and to characterize 
the behavior of loop trip counts. 


Lj Use the _nassert statement to help reduce code size by preventing the 
generation of a redundant loop or by allowing the compiler (with or without 
the —ms option) to software pipeline innermost loops. 


Example 2-8 shows the vector sum code with an _nassert statement. 


Example 2-8. Vector Sum With const Keywords and _nassert 


void vecsum3(short *sum, const short *inl, const short *in2, unsigned int N) 


{ 


The compiler does not generate code for the _nassert() function. _nassert() 
simply communicates information to the compiler that helps it determine in- 
formation about the range of a variable. In Example 2-8, _ nassert() asserts 
that N is always at least 10. 


See the TMS320C6x Optimizing C Compiler User’s Guide for a complete 
discussion of the —ms, —03, and —pm options and the _nassert statement. 
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Using Word Access for Short Data 


2.7 Using Word Access for Short Data 


The ’C62xx has instructions with corresponding intrinsics, such as _ add2(), 
_mpyhi(),_mpylh(), that operate on 16-bit data stored in the high and low parts 
of a 32-bit register. When operating on a stream of short data, you can use 
word (int) accesses to read two short values at a time, and then use ’C62xx 
intrinsics to operate on the data. For example, rewriting the vecsum() function 
(Example 2-9 )to use word accesses doubles the performance of the loop. 
See Section 4.2, Loading Two Data Values with LDW for more information. 


Example 2-9. Vector Sum With const Keywords, _nassert, Word Reads 


void vecsum4(short *sum, const short *inl, const short *in2, unsigned int N) 


{ 


int i; 


(const int *)anly 
(const int *)in2; 
(int *)sum; 


const int *i_inl 
const int *i_in2 
int *i_sum 


_nassert(N >= 20); 


for (i = 0; i < (N >> 1); itt) 
i_sum[i] = _add2(i_inl[i], i_in2[i]); 


This transformation assumes that the pointers sum, in1, and in2 can be cast 
to int*, which means that they must point to word-aligned data. By default, the 
compiler aligns all short arrays on word boundaries; however, a call like the 
following creates an illegal memory access: 


short a[51], b[50], c[50]; vecsum4(&a[1], b, c, 50); 


Another problem is that the loop must now run for an even number of iterations. 
You can handle this problem by padding the short arrays so that the loop 
always operates on an even number of elements. 


If a vecsum() is needed to handle short-aligned data and odd-numbered loop 
counters, then you must add code within the function to check for these cases. 
Knowing what type of data is passed to a function can improve performance 
considerably. It may be useful to write different functions that can handle 
different types of data. If your short-data operations always operate on even- 
numbered word-aligned arrays, then the performance of your application can 
be improved. However, Example 2—10 provides a generic vecsum() function 
that handles all types of data. 
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Example 2-10. Vector Sum With const Keywords, _nassert, Word Reads, Generic Version 


void vecsum5 (short *sum, const short *inl, const short *in2, unsigned int N) 
{ 

aint. 2; 

_nassert (N >= 20); 


if (((int)sum | (int)in2 | (int)inl) & 0x2) 


for (i = 0; i < Nj; itt) 


sum[i] inl[i] + in2[i]; 
} 
else 
{ 
const int *i_inl = (const int *)inl; 
const int *i_in2 = (const int *)in2; 
int *i_sum = (int *) sum; 
for (£ = Oy 2.6. {NW o> 1)2 aed) 
i_sum[i] = _add2(i_inl[i], i_in2[i]); 
if (N & Oxl) sum[i] = inl[i] + in2[i]; 
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Using Word Access for Short Data 


Other intrinsics that are useful for reading short data as words are the multiply 
intrinsics. Example 2-11 is a dot product example that reads word-aligned 
short data and uses the _mpy() and __mpyh() intrinsics: 


[J The_mpyh( intrinsic uses the 'C62xx instruction MPYH, which multiplies 
the high 16 bits of two registers, giving a 32-bit result. 


_j Two sum variables are used (sum? and sum2). 


Using only one sum variable would inhibit parallelism by creating a depen- 
dency between the write from the first sum calculation to the read in the 
second sum calculation. Within a small loop body, avoid writing to the 
same variable since it can inhibit parallelism and create dependences. 


Example 2-11. Dot Product Using Intrinsics 


{ 


int dotprod(const short *a, const short *b, unsigned int N) 


int i, suml = 0 


const int *i_a 
const int *i_b 


for (i 


{ 


} 


return suml + sum2; 


suml 
sum2 


Oe i < 


suml + 
sum2 + 


, sum2 = 0; 


= (const int *)a; 
= (const int *)b; 


(N >> 1); i++) 


_mpy (i_af[i], i_b[i]) 
_mpyh(i_af[i], i_b[i]) 
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Example 2-12 show an FIR filter that uses word reads of short data and 
_mpyXxX() intrinsics. Example 2-13 shows an optimized version of the 
Example 2-12. The optimized version passes an int array instead of casting 
the short arrays to int arrays and, therefore, helps ensure that data passed to 
the function is word aligned. Assuming that a prototype is used, each invoca- 
tion of the function ensures that the input arrays are word aligned by forcing 
you to insert a cast or by using int arrays that contain short data. 


Using Word Access for Short Data 


Example 2-12. FIR Filter—Original Form 


void firl(const short x[], const short h[], short y[], int n, int m, int s) 
{ 

int. dy 37 

long y0; 

long round = 1L << (s - 1); 


for (j = 0; 3 < m; jtt) 


Example 2-13. FlR—Optimized Form 


void fir2(const int x[], const int h[], short y[], int n, int m, int s) 
{ 

int. dy 3; 

long yO, yl; 

long round = 1L << (s - 1); 


for (j = 0; j < (m >> 1); J+t) 
{ 
yO = yl = round; 


for (2 6.0) 2 = (a Se Dis. ae) 
{ 
yO += _mpy (x[i+ jl, h[il); 
yO += _mpyh (x[i + jl, h[i]); 
yl += _mpyhl(x[i + jl, h[i]); 
yl += _mpylh(x[i + j + 1], hil); 
} 
*ytt+ = (int) (yO >> s); 
*y++ = (int) (yl >> s); 
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Loop Unrolling 


2.8 Loop Unrolling 


Another technique that can improve performance is unrolling the loop to 
increase the number of instructions available to execute in parallel. You can 
use loop unrolling when the operations in a single iteration do not use all of the 
the resources of the ’C62xx architecture. 


Example 2-6 (the output of the compiler for the vector sum code in 
Example 2-5) shows that the loop produces anew sun[i] every two cycles: 


1 Three memory operations are being performed: a load for both in1/i] and 
inf2] and a store for sum[I]. 


_j Because only two memory operations can execute per cycle, two cycles 
are necessary to perform three memory operations. 


The performance of a software pipeline is limited by the number of resources 
that can execute in parallel. In its word-aligned form (Example 2-9), the vector 
sum loop delivers two results every two cycles since the two loads and the 
store are all operating on two 16-bit values at a time. 


If you unroll the loop once, the loop then performs six memory operations per 
iteration, which means the unrolled vector sum loop can deliver four results 
every three cycles (that is, 1.33 results per cycle). Example 2-14 shows four 
results for each iteration of the loop: sum[i] and sum[i +sz] each store an int 
value that represents two 16-bit values. 


Example 2-14 is not simple loop unrolling where the loop body is simply repli- 
cated. The additional instructions use memory pointers offset to point midway 
into the input arrays and assumes that the additional arrays are a multiple of 
four shorts in size. 


Example 2-14. Vector Sum With const Keywords, _nassert, Word Reads, and Unrolled 


void vecsum6(int *sum, const int *inl, const int *in2, unsigned int N) 


{ 
an Gs 
int sz = N >> 2; 


_nassert(N >= 20); 


for (i = 0; i < sz; i++) 

{ 
sum[i] = _add2(inl[i], in2[i]J); 
sum[it+sz] = _add2(inl[it+sz], in2[it+sz]); 
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Loop Unrolling 


Software pipelining is performed only on inner loops; therefore, you can 
increase performance by creating larger inner loops. One method for creating 
large inner loops is to completely unroll inner loops that execute for a small 
number of cycles. 


In Example 2-15, the compiler pipelines the inner loop with a kernel size of one 
cycle; therefore, the inner loop completes a result every cycle. However, the 
overhead of filling and draining the software pipeline can be significant, and 
other outer-loop code is not software pipelined. 


Example 2-15. FIR_Type2—Original Form 


void fir2(const short input[], const short coefs[], short out[]) 
{ 

int 2, yy 

int sum = 0; 


for (i = 0; i < 40; it+) 
{ 
for (j = 0; 3 < 16; Jt) 
sum += coefs[j] * input[i + 15 - Jj]; 


out [i] = (sum >> 15); 
} 
} 
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Loop Unrolling 


For loops with a simple loop structure, the compiler uses a heuristic to deter- 
mine if it should unroll the loop. Since unrolling can increase code size, in some 
cases, the compiler does not unroll the loop. If you have identified this loop as 
being critical to your application, then unroll the inner loop in C code, as in 
Example 2-16. 


Now the outer loop is software pipelined, and the overhead of draining and 
filling the software pipeline occurs only once per invocation of the function 
rather than for each iteration of the outer loop. 


Example 2-16. FIR_Type2—Inner Loop Completely Unrolled 


{ 


Int. Ay 4 
int sum; 


for (i = 


{ 


void fir2_u(const short input[], const short coefs[], short out[]) 


’ 


Oy 2 -< 407 I++) 
coefs[0] * input[i + 15]; 
+= coefs[1] * input[i + 14]; 
coefs[2] * input[i + 13]; 
= coefs[3] * input[i + 12]; 
= coefs[4] * input[i + 11]; 
= coefs[5] * input[i + 10]; 
coefs[6] * input[i + 9]; 
coefs[7] * input[i + 8]; 
= coefs[8] * input[i + 7]; 
= coefs[9] * input[i + 6]; 
= coefs[10] * input[i + 5]; 
= coefs[11] * input[i + 4]; 
= coefs[12] * input[i + 3]; 
coefs[13] * input[i + 2]; 
coefs[14] * input[i + 1]; 
= coefs[15] * input[i + 0]; 
= (sum >> 15); 
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2.9 What Disqualifies a Loop From Being Software Pipelined 


In a sequence of nested loops, the innermost loop is the only one that can be 
software pipelined. 


The following restrictions apply to the software pipelining of loops: 


L 


a 


Although a software pipeline loop can contain intrinsics, it cannot contain 
function calls. 


You may not have a conditional break (early exit) in the loop. 


The loop must have a loop counter that counts down and that terminates 
at 0. One reason that you run the optimizer with the —o2 or —03 option is 
to convert as many loops as possible into downcounting loops. 


If the trip counter is modified within the body of the loop, it typically cannot 
be converted into a downcounting loop. For example, the following code 
is not software pipelined 


for (i = 0; i < nj itt) 
i += x; 


A conditionally incremented loop control variable is not software-pipe- 
lined. The following is an example that would not be pipelined. 


for (i = 0; i < x; i++0 


The code size is too large and requires more than the 32 registers in the 
"C62xx. 


A register value is “live-too-long”. See section 4.8, Live-too-Long Issues. 


If the loop has complex condition code within the body that requires more 
than the five ’C62xx condition registers. 
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Chapter 3 


Structure of Assembly Code 


An assembly language program must be an ASCII text file. Any line of 
assembly code can include up to six items: 


Labels 
Conditions 
Instructions 
Units 
Operands 
Comments 


UOUUOUCU 
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Sule Mab ols Gee: emia ted cetera eee tr sec eee nee eee ee B-2| 
3:2, (Parallel Bars ioycteyejec acco spe revecsie ec vierecsls)orein eit alayayscals/ejesevavavaceseiecaverare eyeiars B-2| 
ce {eeiChiCis: canaanecsamarensnasneodocuonecnseanauaapodnncangoens B-3| 
Si! MSW scosdeannanononnenondodonondnsanonbannnonedonnoons 
B15 s Unliseeterncn Seamer nn ne sec eat a meer ecm eee 
3.6 Operands: 25622 c cece eaceees es tine eenae weene dees eens na ge cies 


Bere (GOMMENUS ocr cree avarice istelers, coneesayciaca aroveney ereweesrerseeve sett crarsieyerece aeerece 


3-1 


Labels / Parallel Bars 


3.1 Labels 


Labels identify a line of code or a variable and represent memory addresses 
that contain either an instruction or data. 


Figure 3—1 shows the position of the label in aline of assembly code. The colon 
following the label is optional. 


Figure 3-1. Labels in Assembly Code 


label: parallel bars [condition] instruction unit operands ; comments 


Labels must meet the following conditions: 


Lj The initial character of a label must be a letter. 
Lj The first character of the label must be in the first column of the text file. 
[j Labels can include up to 32 alphanumeric characters. 


3.2 Parallel Bars 


Figure 3-2. Parallel Bars in Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


Instructions that execute in parallel with the previous instruction signify this 
with parallel bars (||). This field is left blank for instructions that do not execute 
in parallel with the previous instruction. 


3.3 Conditions 


Conditions 


Five registers in the ’C62xx are available for conditions: A1, A2, BO, B1, and 
B2. Figure 3-3 shows the position of a condition in a line of assembly code. 


Figure 3-3. Conditions in Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


All °C62xx instructions are conditional: 


L) 


L 


If no condition is specified, the instruction is always performed. 


If a condition is specified and that condition is true, the instruction 
executes. For example: 

With this condition... |The instruction executes if ... 

[Al] A1!=0 

[!Al] A1=0 

If a condition is specified and that condition is false, the instruction does 
not execute. 

With this condition... |The instruction does not executes if... 

[Al] A1=0 

[!Al] A1!=0 
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3.4 


Instructions 


Assembly code instructions are either directives or mnemonics: 


a 


Assembler directives are commands for the assembler (asm6x) that 
control the assembly process or define the data structures (constants and 
variables) in the assembly language program. All assembler directives 
begin with a period, as shown in the list in Table 3-1. 


Processor mnemonics are the actual microprocessor instructions that 
execute at runtime and perform the operations in the program. Table 3-2 
summarizes the ‘C62xx mnemonics. Processor mnemonics must begin in 
column 2 or greater. 


Figure 3—4 shows the position of the instruction in a line of assembly code. 


Figure 3—4. Instructions in Assembly Code 


label: parallel bars [condition] instruction unit operands ; comments 


Table 3-1. ’C62xx Directives 
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Directives Description 

-sect “name” Creates section of information (data or code) 

-int value Reserve 32 bits in memory and fill with specified value 
long value 

-word value 

-short value Reserve 16 bits in memory and fill with specified value 
-half value 

-byte value Reserve 8 bits in memory and fill with specified value 


Table 3-2. ’'C62xx Mnemonics 


Arithmetic 


ABS 
ADD 
ADDA 
ADDK 
ADD2 
SADD 
SAT 
SSUB 
SUB 
SUBA 
SUBC 
SUB2 


Multiply 


MPY 
MPYH 
MPYHL 
MPYLH 
SMPY 


LD 
MVK 
MVKH 
ST 
STP 


Program 
Load/Store Control 


B 
B IRP 
B NRP 


Bit 


Management Logical 


CLR 
EXT 
LMBD 
NORM 
SET 


AND 
CMPEQ 
CMPGT 
CMPLT 
OR 

SHL 
SHR 
SSHL 
XOR 


Instructions 


Pseudo/Other 
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Units 


3.5 Units 


The ’C62xx CPU contains eight functional units, which are shown in 
Figure 3-5. 


Table 3—3. Functional Units and Descriptions 


Functional Unit Description 


-L unit (.L1, .L2) 32/40-bit arithmetic and compare operations 
Left most 1, 0, bit counting for 32 bits 
Normalization count for 32 and 40 bits 
32 bit logical operations 


-S unit (.S1, .S2) 32-bit arithmetic operations 
32/40 bit shifts and 32-bit bit-field operations 
32 bit logical operations, 
Branching 
Constant generation 
Register transfers to/from the control register file 


-M unit (.M1,.M2) 16 x 16 bit multiplies 


-D unit (.D1,.D2) 32-bit add, subtract, linear and circular address calcula- 


tion 
Figure 3-5. ’C62xx Functional Units 
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Memory 


Figure 3-6 shows the position of the unit in a line of assembly code. 


Units 


Figure 3-6. Units in the Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


Specifying the functional unit in the assembly code is optional. The functional 
unit can be used to document which resource(s) each instruction uses. 
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3.6 Operands 


The ’C62xx architecture requires that memory reads and writes move data 
between memory and a register. Figure 3—7 shows the position of the oper- 
ands in a line of assembly code. 


Figure 3—7. Operands in the Assembly Code 


label: parallel bars [condition] instruction unit operands =; comments 


Instructions have the following requirements for operands in the assembly 
code: 


a 


L) 
Lj 
L] 


All instructions require a destination operand. 

Most instructions require one or two source operands. 

Destination operands must be on same side as one source. 

One source operand per execute packet can come from the other side. 


When an operand comes from the other register file, the unit includes an X 
as shown in Figure 3-8. 


Figure 3—8. Operands in Instructions 


ADD -L1 AO0O,A1,A3 


ADD .L1X AO,B1,A3 


All registers except B1 are on same side of the CPU. 


The ’C62xx instructions use three types of operands to access data: 


Lj 
Lj 
L] 


Register operands indicate a register that contains the data. 
Constant operands specify the data within the assembly code. 
Pointer operands contain addresses of data values. 


Only the load and store instructions require and use pointer operands to 
move data values between memory and a register. 


Comments 


3.7 Comments 


As with all programming languages, comments provide code documentation. 
Figure 3-9 shows the position of the unit in a line of assembly code. 


Figure 3-9. Comments in Assembly Code 


label: parallel bars [condition] instruction unit operands 3; comments 


The following are guidelines for using comments in assembly code: 


_j May begin in any column when preceded by a semicolon (;) 


_j Must begin in first column when preceded by an asterisk (*) 
Lj Not required, but recommended 
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Chapter 4 


Optimizing Assembly Code 


This chapter describes methods that help to develop more efficient assembly 
language programs. The primary purpose of this chapter is to help you under- 
stand the code produced by the assembly optimizer and to help you perform 
manual optimization. 
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Writing Parallel Code 


4.1 Writing Parallel Code 


One way to optimize assembly code is to reduce the number of execution 
cycles in a loop. You can do this by rewriting serial instructions so that they 
execute in parallel. 


4.1.1. Dot-Product C Code 


The C code in Example 4—1 represents a dot-product algorithm that includes 
the following operations: 


Lj Multiply each element in array a by the corresponding element in array b 
Lj Accumulate each product in sum 


Example 4—1. Dot-Product C Code 


int dotp(short a[], short b[] ) 
{ 


int sum, i; 
sum = 0; 


for(i=0; i<100; i++) 
sum += a[i] * b[i]; 


return (sum) ; 


} 


4.1.2 Translating C Code to ’C62xx Instructions 


Example 4—2 shows the translation of the C code to ’C62xx instructions and 
illustrates the following decisions that affect the assignment of units: 


Lj} The load halfword (LDH) instructions increment through the a and b ar- 
rays. Each LDH does a post increment on the pointer and, therefore, on 
each iteration sets the pointer to the next halfword (16 bits) in the array. 
The load instructions must use a .D unit. 


Lj All multiply (MPY) instructions must use a .M unit. 


[1 The ADD instruction accumulates the total of the results from the multiply 
(MPY) instruction. 


1 The subtract (SUB) instruction decrements the loop counter. 


_j The branch (B) instruction is conditional on the loop counter, A1, and 
executes only until A1 is 0. Branch (B) instructions must use a .S unit. 


Writing Parallel Code 


Example 4-2. Translating C Code to ’C62xx Instructions 


(a) C code 


{ 
int sum, i; 
sum = 0; 


for (i=0; i<100; i++) 


return (sum); 


} 


sum += a[i] * b[il]; 


int dotp(short a[], short b[] ) 


(b) ’C62xx instructions 


LDH D1 
LDH -D1 
MPY -M1 
ADD -L1 
SUB .S1 
[Al] B ~S2 


load ai from memory 
load bi from memory 

ai * bi 

sum += (ai * bi) 
decrement loop counter 
branch to loop 
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Writing Parallel Code 


4.1.3 Drawing a Dependency Graph 


Dependency graphs can help analyze loops by showing the flow of instruc- 
tions and data in an algorithm. These graphs also show how instructions 
depend on one another. The following terms are used in defining a depen- 
dency graph. 


Lj A node is a point on a dependency graph with one or more data paths 
flowing in and/or out. 


J The path shows the flow of data between nodes. The numbers beside 
each path represent the number of cycles required to complete the instruc- 
tion. 


Lj Aninstruction that writes to a variable is referred to as a parent instruction 
and defines a parent node. 


Lj An instruction that reads a variable written by a parent instruction is re- 
ferred to as its child and defines a child node. 


Use the following steps to draw a dependency graph: 


) Define the nodes based on the variables accessed by the instructions. 
) Define the data paths that show the flow of data between nodes. 

3) Add the instructions and the latencies. 

) Add the functional units. 


Figure 4-1 shows the dependency graph for the dot-product assembly 
instructions: 


[1 The two LDH instructions, which write the values of ai and bi, are parents 
of the MPY instruction. 


Lj The MPY instruction, which writes the product, pi, is the parent of the ADD 
instruction. 


[J The ADD instruction adds pi, the result of the MPY to sum. The output of 
the ADD instruction feeds back to become an input on the next iteration 
and, thus, creates a /oop carry path. 


The dependency graph for the dot-product algorithm has two separate graphs 
because the decrement of the loop counter and the branch do not read or write 
any variables from the other graph. 


[1 The SUB instruction writes to the loop counter, ctr. The output of the SUB 
instruction feeds back and creates a loop-carry path in this graph. 


Lj) The branch (B) instruction is a child of the loop counter. 


Figure 4—1. Dependency Graph for Dot Product 
LDH LDH 


SUB 


S1 


Example 4-3. Serial Assembly 


Writing Parallel Code 


LDH -D1 *R4++,A2 ; load ai from memory 

LDH Da *A3++,A5 ; load bi from memory 

MPY .M1 A2,A5,A6 sae ARS Ba 

ADD -L1 A6,A7,A7 ; sum += (ai * bi) 

SUB .S1 Al1,1,Al1 ; decrement loop counter 
{Al] B sot LOOP ; branch to loop 
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4.1.4 Serial vs. Parallel Code 


Example 4—4 shows an example of a dot-product loop written serially. The 
NOP instructions allow for the delay slots of the LDH, MPY, and branch 
instructions. 


Executing this dot-product code serially requires 16 cycles for each iteration 
plus two cycles to set up the loop counter and initialize the accumulator; 
100 iterations require 1602 cycles. 


Example 4—4. Dot-Product Serial Assembly 


LOOP: 


[Al] 


NOP 
MPY 
NOP 
ADD 
SUB 
B 
NOP 


5 


100, Al ; set up loop counter 
A7 ; zero out accumulator 
*A4++,A2 ; load ai from memory 
*A3++,A5 ; load bi from memory 
; delay slots for LDH 
A2,A5,A6 + aos ba 
; delay slot for MPY 
A6,A7,A7 ; sum += (ai * bi) 
A1,1,Al1 ; decrement loop counter 
LOOP ; branch to loop 


; Branch occurs here 


; delay slots for branch 


To help improve the performance of this loop, you can assign the functional 
units to execute the code in parallel, as shown in the dependency graph in 
Figure 4-2. 


a 


Since the loads of ai and bi do not depend on one another, both LDH 
instructions can execute in parallel as long as they do not share the same 
resources. To accomplish this, you can allocate the following functional 
units: 


M aiand the pointer to a/to a functional unit on the A side, .D1 
mM bDiand the pointer to bi to a functional unit on the B side, .D2 


Because the MPY instruction now has one source operand from A and one 
from B, the MPY uses the 1X cross path. 


The SUB instruction can take the place of one of the NOP delay slots for 
the LDH instructions. 


Moving the B instruction after the SUB removes the need for the NOP 5 
at the end of the code in Example 4—4. The branch now occurs immedi- 
ately after the ADD instruction, so that the MPY and ADD execute in 
parallel with the five delay slots required by the branch instruction. 


Writing Parallel Code 


Executing this dot-product code in Example 4—5 requires eight cycles for each 
iteration plus one cycle to set up the loop counter and initialize the accumula- 
tor; 100 iterations require 801 cycles. 


Figure 4—2. Dependency Graph for Parallel Assembly 


LDH LDH 
.D2 
SUB 
1 S1 
1 
B 
S1 

Example 4—5. Dot-Product Parallel Assembly 

MVK aoe 100, Al ; set up loop counter 
| | ZERO -L1 A7 yj; zero out accumulator 
LOOP: 

LDH DL *R4++,A2 ; load ai from memory 
| | LDH .D2 *B4++,B2 ; load bi from memory 

SUB Peel Al,1,Al1 ; decrement loop counter 

[Al] B ~S2 LOOP ; branch to loop 

OP 2 ; delay slots for LDH 

PY .-M1X A2,B2,A6 ; al * bi 

OP ; delay slots for MPY 

ADD -L1 A6,A7,A7 ; sum += (ai * bi) 
; Branch occurs here 


4.1.5 Comparing Performance of Serial and Parallel Code 


Table 4—1 compares the performance of the serial code with the parallel code. 


Table 4—1. Comparison of Serial and Parallel! Code 


Code Example 100 Iterations Cycle Count 
Example 4—4 Serial Assembly 2+ 100 x 16 1602 
Example 4—5 Parallel Assembly 1+100 x8 801 
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4.2 Loading Two Data Values With LDW 


4.2.1 


In writing the parallel code in section 4.1, you used an LDH instruction to read 
ali]. Because a/fi] and afi + 1] are next to each other in memory, you can use 
the load word (LDW) instruction to read a/i] and afi + 1] at the same time and 
to load both into a single 32-bit register. 


Unrolled Dot-Product C Code 


The C code in Example 4-6 has the effect of unrolling the loop by accumu- 
lating the even elements, a/i/and b/i], into sum0 and the odd elements, a/i + 1] 
and bfi + 1], into sum. After the loop sum0 and sum? are added to produce 
the final sum. 


Example 4-6. Unrolled Dot-Product C Code 


int dotp(short a[], short bf[] ) 
{ 


int sum0, suml, sum, i; 


sum0O = 0; 
suml = 0; 
for (i=0; i<100; i+=2) { 


) 
sum0 += a[i] * 
suml += afi +1 
} 
sum = sum0 + suml; 
return (sum) ; 


} 


blil; 
}. * bts 4D]; 


4.2.2 Translating the Unrolled C Code to ’C62xx Instructions 


Example 4—7 shows the ’C62xx instructions that execute the unrolled loop. 


_j) The two load word (LDW) instructions load afi], afi + 1], bfi], and b/i + 1] 
on each iteration. 


_j Two MPY instructions are now necessary to multiply the second set of 
array elements: 


m The first MPY instruction, which is the same as the one in 
Example 4—7, multiplies the 16 least significant bits (LSBs) in each 
source register: afi] x bf i]. 


m@ TheMPYH instruction multiplies the 16 most significant bits (MSBs) of 
each source register: afi+ 1] x b [i+ 1]. 


Loading Two Data Values With LDW 


Note: 


This is only true for the case when the ’C62xx is in little-endian mode. In big- 
endian mode, MPY operates on a[i+1] and b[i+1] and MPYH operates on ai] 
and b[i]. Refer to the TMS320C62xx Peripherals Reference Guide for more 


information. 
a | 


(J The two ADD instructions accumulate the sums of the even and odd 


elements: sum0 and sum7. 


Example 4—7. List of Symbolic Dot-Product Instructions 


*aptr+ 


*bptr+1 


ai_itl,bi_i+1,pitl 


+t, ai_itl . 
t+, bi_itl ; 
ai_itl,bi_i+l,pi y tad 7 3b2 
; aitl * bitl 
pi,sum0, sum0 ; sum0 += (ai * bi) 
suml += 


pitl,suml, suml ; 
cntr,1,;cntr ; 


LOOP 


; branch to loop 


load ai and ai+l from memory 
load bi and bi+1l from memory 


(ai+l * bitl) 
decrement loop counter 
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4.2.3. Drawing a Dependency Graph for the Unrolled Loop 
Figure 4—3 shows the dependency graph for the unrolled loop: 


[J The LDW instructions are parents of the MPY instructions. 
_j The MPY instructions are parents of the ADD instructions. 


To split the graph between the A and B registers, follow these guidelines: 


_j Place an equal number of LDWs, MPYs, and ADDs on each side. 


_j To keep both sides even, place the remaining two instructions, branch and 
SUB, on opposite sides. 


Figure 4—3. Dependency Graph of Dot Product With LDW 


A Side B Side 


LDW LDW 
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4.2.4 Allocating Resources 


Loading Two Data Values With LDW 


After splitting the dependency graph, you can allocate functional units and reg- 
isters, as shown in the dependency graph in Figure 4—4 and the instructions 
in Example 4-8. The .M1X and .M2X represent a path in the dependency 
graph crossing from one side to the other. 


Figure 4—4. Dependency Graph of Dot Product With LDW 


.D1 


A Side 


LDW 
S&S 
5 


B Side 
LDW 
' bi&bi+1 } -D2 
Son 
Co) 


.L2 


Example 4—8. Dot Product Instructions With LDW 


*R4++,A2 
*B4++,B2 
A2,B2,A6 
A2,B2,B6 
AG6,AT7,A7 
B6,B7,B7 
Al,1,Al 
LOOP 


Se TT en Ty 


load ai and ai+l from memory 
load bi and bi+l from memory 


ai * bi 
ai+l * bitl 
sum0 += (ai * bi) 


suml += (ait+l * bi+tl1) 
decrement loop counter 
branch to loop 
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4.2.5 Adding the Setup Code 


Example 4-9 shows the assembly code for the unrolled loop, using LDW 
instructions instead of LDH instructions. 


Lj The setup code assumes that A4 and B4 have been initialized to point to 
arrays a and b, respectively. 


Lj The MVK instruction initializes the loop counter. 


[1 The two ZERO instructions, which execute in parallel, initialize the even 
and odd accumulators (sum0 and sum7) to 0. 


Lj The third ADD instruction adds the even and odd accumulators. 


Executing the dot-product code with the optimizations in Example 4-9 
requires only 50 iterations, because you operate in parallel on both the even 
and odd array elements. With the setup code and the final ADD instruction, 
100 iterations of this loop require a total of 402 cycles (1 +8 x 50+ 1). 


Example 4-9. Dot-Product Assembly With LDW 


[Al] 


MVK xl 50,Al ; set up loop counter 

ZERO -L1 A7 ; zero out sum0 accumulator 
ZERO .liZ B7 yj; zero out suml accumulator 
LDW D1 *A4++,A2 ; load ai & ai+l from memory 
LDW D2 *BA++, BZ ; load bi & bi+l from memory 
SUB +S: Al,1,Al1 ; decrement loop counter 

B Pace LOOP 7 branch to loop 

OP 

PY MIX A2,B2,A6 ; ai * bi 

PYH M2X A2,B2,B6 j ait+l * bitl 

OP 

ADD -L1 A6,A7,A7 ; sum0t+= (ai * bi) 

ADD ~L2 B6,B7,B7 ; suml+= (aitl * bitl) 

; Branch occurs here 

ADD .L1X A7,B7,A4 ; sum = sum0 + suml 
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4.2.6 Comparing Performance With Use of LDW 


Table 4—2 compares the performance of the different versions of the dot- 
product code. 


Table 4-2. Comparison With Use of LDW 


Code Example 


100 Iterations Cycle Count 
Example 4—4 Dot-Product Serial Assembly 2+100 x 16 1602 
Example 4—5 Dot-Product Parallel Assembly 1+100 x 8 801 


Example 4-9 Dot-Product Assembly With LDW 1+(50x 8)+1 402 
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4.3 Software Pipelining 
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Software pipelining is a technique used to schedule instructions from a loop 
so that multiple iterations execute in parallel. The parallel resources on the 
*C62xx make it possible to initiate a loop iteration before previous iterations 
finish. The goal of software pipelining is to start a new loop iteration as soon 
as possible 


The dot-product code in Example 4—9 needed eight cycles for one iteration of 
the loop, five cycles for the LDWs, two cycles for the MPYs, and one cycle for 
the ADDs. After the first eight cycles, you then start and finish an iteration every 
cycle. 


Figure 4—5 shows the dependency graph for the dot-product instructions. 
Example 4—10 shows the dot-product instructions from Example 4—8 with the 
SUB instruction now conditional on A1. 
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Figure 4-5. Dot-Product Instructions With Conditional SUB Instruction 


A Side 


LDW 
5 


B Side 

LDW 

bi & bi+1 
Kon 


Example 4—10. Dot Product Instructions With Functional Units 


LDW -D1 *R4++,A2 ; load ai and ait+tl from memory 
LDW .D2 *B4++,B2 ; load bi and bit+tl from memory 
MPY .M1X A2,B2,A6 ear ee of 
MPYH .M2X A2,B2,B6 j; ait+l * bitl 
ADD ~L1 A6,A7,A7 ; sum0O += (ai * bi) 
ADD -L2 B6,B7,B7 ; suml += (ai+l * bitl) 

[Al] SUB o Sl Al1,1,Al1 ; decrement loop counter 

[Al] B «S2 LOOP ; branch to top of loop 
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4.3.1. Using the Modulo Iteration Interval Table 


The iteration interval of a loop is the number of cycles between the initiations 
of successive iterations of that loop. 


The modulo iteration interval scheduling table shows how a software-pipelined 
loop executes and keeps track of resources on a cycle-by-cycle basis to 
ensure that no resource is used twice on any given cycle. 


Table 4—3 shows a modulo iteration interval table for the dot-product loop be- 
fore software pipelining (Example 4-9): 


_j Each row represents a functional unit 
Lj The columns indicate what is executing on a particular cycle. 
Lj In this example, each unit is used only once every eight cycles: 


LDWs on the .D units are issued on cycles 0, 8, 16, 24, etc. 

MPY and MPYH on the .M units are issued on cycles 5, 13, 21, 29, etc. 
ADDs on the .L1 units are issued on cycles 7, 15, 23, 31, etc. 

SUB on the .S1 units is issued on cycles 1, 9, 17, 25, etc. 

B on the .S2 unit is issued on cycles 2, 10, 18, 24, etc. 


Table 4—3. Dot-Product Modulo Iteration Interval Table 


Unit\Cycle 
.D1 


.D2 
M1 
.M2 


L1 


.L2 


St 
S2 
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4.3.2 Determining the Minimum Iteration Interval 


The minimum iteration interval of a loop is the minimum number of cycles you 
must wait between each initiation of successive iterations of that loop. The 
smaller the iteration interval, the fewer cycles it takes to execute a loop. 


Resources and data dependency constraints determine the minimum iteration 
interval. 


(J The most-used resource constrains the minimum iteration interval. For ex- 
ample, if four instructions in a loop all use the .S1 unit, the minimum itera- 
tion interval is at least 4: four instructions using the same resource cannot 
execute in parallel and, therefore, require at least four separate cycles to 
execute each instruction. 


(1 With the SUB and branch instructions on opposite sides of the depen- 
dency graph in Figure 4—5, all eight instructions use a different functional 
unit and no two instructions use the same cross paths (1X and 2X). 
Because no two instructions use the same resource, the minimal iteration 
interval based on resources is 1. 


(1 Inthe dot-product instructions, no data dependencies affect the minimum 
iteration interval. 
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4.3.3 Creating a Fully Pipelined Schedule 


Having determined the modulo iteration interval is 1, you can initiate a new it- 
eration every cycle. You can schedule LDW and MPY instructions on every 
cycle. Table 4—4 shows a fully pipelined schedule for the dot-product example: 


Lj The right-most column is a single-cycle loop that contains the entire loop. 
1 Cycles 0-6 are loop setup code, or loop prologue. 


Asterisks define which iteration of the loop the instruction is executing each 
cycle. For example, the right-most column shows that on any given cycle in- 
side the loop: 


The ADD instructions are adding data for iteration n. 

The MPY instructions are multiplying data for iteration n+2 (**). 
The LDW instructions are loading data for iteration n+7 (*******). 
The SUB instruction is executing for iteration n + 6 (******). 

_j) The branch instruction is executing for iteration n + 5 (*****). 


L} 
L) 
Lj 
Lj 


Inthis case, multiple iterations of the loop execute in parallel in a software pipe- 
line that is eight iterations deep, with iterations nand n+ 7executing in parallel. 
Software pipelines are rarely deeper than the one created by this single-cycle 
loop. As loop sizes grow, the number of iterations that can execute in parallel 
tends to become less. 


Table 4-4. Dot-Product Modulo Iteration Interval Table 


Unit\Cycle 


.L1 


.L2 


M1 


.M2 


.D1 


LDW 


REKKKKK 


LDW 


.D2 


LDW 


REKKKKK 


LDW 


St 


S2 


KaKKKK 


Note: The asterisks indicate the iteration of the loop; shading indicates the single-cycle loop. 
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4.3.4 Software Pipelined Dot Product 


The following describes Example 4—11, which shows the assembly code for 
Table 4—4. 


Ly 


L 


The accumulators are initialized to 0 and the loop counter is set up in the 
first execute packet in parallel with the first LDW instructions. 


Asterisks in the comments indicate the iterations that execute like those 
in Table 4—4. 


The branch target is the execute packet defined by the label LOOP and 
multiple branch instructions are in the pipe. 


m The first branch is issued on cycle 2, but does not actually branch until 
the end of cycle 7 after five delay slots. The second branch is issued 
on cycle 3, but does not branch until cycle 8. 


m@ Oncycle7, the first branch returns to the same execute packet, result- 
ing in a single-cycle loop. 

mM Onevery cycle after cycle 7,a branch executes back to LOOP until the 
loop counter finally decrements to 0. 


m@ Once the loop counter is 0, five more branches execute because they 
are already in the pipe. 


If the SUB instruction were not conditional on A1, you would have an infinite 
loop: 


a 


L) 


a 


As the loop executes five more times, the loop counter, if not conditional, 
would become a negative number on the next decrement. 


When A1 is negative, it is nonzero and, therefore, causes the condition on 
the branch to be true again. 


Making the SUB instruction conditional on A1 ensures that A1 stops 
decrementing when it reaches 0. 


Executing the dot-product code with the software pipelining as shown in 
Example 4—11 requires a total of 58 cycles (7 + 50 + 1), which is a significant 
improvement over the 402 cycles required by the code in Example 4-9. 
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Example 4-11. Software Pipelined Dot Product 


LDW -D1 *A4++,A2 ; load ai & ait+tl from memory 
LDW -D2 Ba++, BZ ; load bi & bitl from memory 
MVK wold 50,Al ; set up loop counter 
ZERO Pace li A7 ; zero out sum0O accumulator 
ZERO sdi2 By ; zero out suml accumulator 
[Al] SUB -S1 Al,1,Al1 ; decrement loop counter 
LDW -D1 *A4++,A2 7* load ai & ait+tl from memory 
LDW se *BA++,B2 ;* load bi & bitl from memory 
[Al] SUB «Sed. Al1,1,Al1 7* decrement loop counter 
[A1] B «S52 LOOP ; branch to loop 
LDW -D1 *R4++,A2 ;** load ai & ait+l from memory 
LDW «D2 eBA++, BZ 7** load bi & bitl from memory 
[Al] SUB Slt Al1,1,Al1 7** decrement loop counter 
[A1] B ~S2 LOOP 7* branch to loop 
LDW -D1 *AR4++,A2 7*** load ai & ait+tl from memory 
LDW 2D2 ABA et Be 7*** load bi & bitl from memory 
[Al] SUB wou Al1,1,Al1 7*** decrement loop counter 
[A1] B oS2Z LOOP 7** branch to loop 
LDW -D1 *R4++,A2 7;**** load ai & ai+l from memory 
LDW D2 *B4++,B2 7**** load bi & bitl from memory 
MPY .MIX A2,B2,A6 j ai * bi 
MPYH .M2X A2,B2,B6 j aitl * bitl 
[Al] SUB Sit Al,1,Al1 ;**** decrement loop counter 
[Al] B 282 LOOP 7*** branch to loop 
LDW -D1 *RA++,A2 7***** Td ai & aitl from memory 
LDW sD2 *BA¢+, B2 7***** Td bi & bitl from memory 
MPY .MIX A2,B2,A6 ;* ai * bi 
MPYH .M2X A2,B2,B6 ;* aitl * bitl 
[Al] SUB usa Al,1,Al1 7; **x*x** decrement loop counter 
[Al] B ~S2 LOOP 7x*** branch to loop 
LDW -D1 *A4++,A2 7****** Td ai & aitl from memory 
LDW D2 *Bad+ BZ 7**x**x** Td bi & bitl from memory 
LOOP 
ADD -L1 A6,A7,A7 ; sum0O += (ai * bi) 
Mil ADD ~L2 B6,B7,B7 ; suml += (ait+l * bitl) 
| | MPY -M1X A2,B2,A6 Petes 5 Lo BB 
| | MP YH 2X A2,B2,B6 gee aatd * bi+l 
| | [Al] SUB Fecak Al,1,Al1 7***x*x** decrement loop counter 
|| [A1] B 22 LOOP 7***** branch to loop 
I | LDW -D1 *A4++,A2 pxxxkexx Td ai & ait+l fm memory 
| LDW abZ *B4++,B2 p*xxeex*x Ld bi & bitl fm memory 
; Branch occurs here 
ADD .L1X A7,B7,A4 , sum = sum0 + suml 
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4.3.5 Removing Extraneous Instructions 


The code in Example 4-11 executes multiple iterations with the following 
events occurring in parallel: 


[1 Iteration 50 of the ADD instructions 
[1 Iteration 52 of the MPY and MPYH instructions 
[1 Iteration 57 of the LDW instructions 


In most cases, extra iterations are not a problem, except that when extraneous 
LDWs access unmapped memory, you can get unpredictable results. To re- 
move the extraneous LDW and MPY instructions, you can add an epilogue that 
is included in the second part of Example 4—12 on page 4-23. 


.) To eliminate LDWs from the iterations 51 through 57, run the loop seven 
fewer times. 


Lj The loop counter is 43 (50-7), which means you still must execute seven 
more cycles of ADD instructions and five more cycles of MPY instructions. 


.j Five pairs of MPYs and seven pairs of ADDs are now outside the loop. The 
LDWs, MPYs, and ADDs all execute exactly 50 times. 


Executing the dot-product code in Example 4—12 with no extraneous LDWs 
still requires a total of 58 cycles (7 + 43 + 7 + 1). 
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Example 4-12. Software Pipelined Dot Product With No Extraneous Loads 
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LDW .D1 
LDW .D2 
MVK $i 
ZERO 1 
ZERO 12 
[Al] SUB 281 
LDW .D1 
LDW D2 
[Al] SUB .S1 
[Al] B .S2 
LDW -D1 
LDW .D2 
[Al] SUB «Si 
[Al] B .S2 
LDW .D1 
LDW .D2 
Al] SUB .S1 
[Al] B +82 
LDW .D1 
LDW D2 
MPY 1X 
MPYH .M2X 
[Al] SUB .S1 
[Al] B +82 
LDW .D1 
LDW D2 
MPY 1X 
MPYH .M2X 
[Al] SUB Si 
[Al] B .S2 
LDW .D1 
LDW D2 
LOOP 
ADD .L1 
| | ADD b2 
| | MPY .M1X 
| | MPYH .M2X 
| | [Al] SUB SL 
|| {A1l] B 5S2 
| | LDW .D1 
|| LDW .D2 


*RA++,A2 
*BA++,B2 
43,Al 

AT 

B7 
Al,1,Al 
*RA++,A2 
*BA++,B2 
Al,1,Al 
LOOP 
*RA++,A2 
*BA++,B2 
Al,1,Al 
LOOP 
*RA++,A2 
*BA++,B2 
Al,1,Al 
LOOP 
*RA++,A2 
*BA++,B2 
A2,B2,A6 
A2,B2,B6 
Al,1,Al 
LOOP 
*RA+4+,A2 
*BA++,B2 
A2,B2,A6 
A2,B2,B6 
Al,1,A1l 
LOOP 
*RA++,A2 
*BA++,B2 
A6,A7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 
Al,1,Al 
LOOP 
*RA++,A2 
*BA++,B2 


; Branch occurs here 


Y 


load ai & ait+tl from memory 
load bi & bi+l from memory 
set up loop counter 

zero out sum0O accumulator 
zero out suml accumulator 


decrement loop counter 
* load ai & ai+l from memory 
* load bi & bit+l from memory 


;* decrement loop counter 


branch to loop 


7** load ai & ait+tl from memory 


, 


, 


a 


;** load bi & bitl from memory 


;** decrement loop counter 
;* branch to loop 


7*** load ai & ait+tl from memory 
7*** load bi & bit+tl from memory 


a 


**x* decrement loop counter 


;** branch to loop 


7**** load ai & aitl from memory 
7**** load bi & bitl from memory 


, 


, 


al, * «ba 
aitl * bitl 


;**** decrement loop counter 


*** branch to loop 


7***** Td al & ai+l from memory 
7***** Td bi & bit+l from memory 


, 


, 


pe ea 8 oa 
,;* aitl * bitl 


7***** decrement loop counter 
7**** branch to loop 

7**x**** Td ai & aitl from memory 
7**x**x** Td bi & bitl from memory 


’ 
a 


, 


7** aitl 


7** ai * bi 


(ai * bi) 
(aitl * bitl) 


sum0 += 
suml += 


*bakl 


7****x** decrement loop counter 
;***** branch to loop 

7xx*x*** Td ai & ait+l fm memory 
pxxxxx** Td bi & bit+tl fm memory 
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Example 4—12. Software Pipelined Dot Product With No Extraneous Loads (Continued) 


ADDs MPYs 
ADD a llsil A6,A7,A7 S sum) += (ail * tonal) @) 
ADD 5 iy 1G, 137) 18s 7) & sill a= (easel ~ Joatsr il) 
MPY .M1X A2,B2,A6 pare al Teal @) 
MPYH .M2X A2,B2,B6 ;** aitl * bitl 
ADD oe lisil A6,A7,A7 5 sii) += (ea * toa) (2) 
ADD pal Bia Bi Su: & sill a= (eisai ~ loatsril)) 
MPY .M1X A2,B2,A6 pare gal % feat @) 
MPYH .M2X A2,B2,B6 ;** aitl * bitl 
ADD 5 li B6,B7,B7 2 simi += (easil  Jod+#il) @) 
MPY .M1X A2,B2,A6 pee al & foal 
MPYH .M2X A2,B2,B6 ;** aitl * bit+l @) 
ADD o lsil A6,A7,A7 e Sui) += (ad * tod) @) 
ADD 5 ly 1G, 18) 7), Bh 7 S sul a= (@asril = loatsril)) 
MPY .M1X A2,B2,A6 pee aul # foal @) 
MPYH .M2X A2,B2,B6 ;** aitl * bitl 
ADD Sibi A6,A7,A7 ; sum0 += (ai * bi) 6) 
ADD 5 ly 13G),, 18) 7), 18) 7 S summll a= (@iari = lealsril)) 
MPY .M1X A2,B2,A6 pate aul 4 foal 6) 
MPYH .M2X A2,B2,B6 7** aitl * bitl 
ADD int A6,A7,A7 S Sibi) += (eid * Yea) ©) 
ADD oli 136,137 5 7 e suimil t= (earl % lealzril)) 
ADD int A6,A7,A7 2 sui) += (ea * joa) @) 
ADD palin, BiG Sense e sui s (@asri = loalzril)) 
ADD -L1X A7,B7,A4 ; sum = sum0O + suml 


Optimizing Assembly Code 4-23 


Software Pipelining 


4.3.6 Priming the Loop 
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Although Example 4—12 executes as fast as possible, the code size could be 
smaller. To help reduce code size, you can use a technique called priming the 
loop. Assuming that you can handle extraneous LDWs, start with 
Example 4—11, which has no epilogue and, therefore, fewer instructions. (This 
technique could be used equally well on Example 4—12.) 


To eliminate the prologue and, therefore, the extra LDW and MPY instructions, 
begin execution at the loop body (at the LOOP label). Eliminating the prologue 
has several implications: 


Lj Two LDWs, two MPYs, and two ADDs occur in the first execution cycle of 
the loop. 


Lj Since the first LDWs require five cycles to write results into a register, the 
MPYs do not multiply valid data until after the loop executes five times. The 
ADDs have no valid data until after seven cycles (five cycles for the first 
LDWs and two more cycles for the first valid MPYs). 


Example 4—13 shows the loop without the prologue but with four new instruc- 
tions that zero out the inputs to the MPY and ADD instructions. 


[J Making the MPYs and ADDs use 0s before valid data is available ensures 
that the final accumulate values are unaffected. 


_j The loop counter is initialized to 57 to accommodate the seven extra 
cycles needed to prime the loop. 


Executing the dot-product code in Example 4—13 is less efficient than the code 
in Example 4-11. Because the first LDWs are not issued until after seven 
cycles, the code in Example 4—13 requires a total of 65 cycles (7 + 57+ 1). 
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Example 4—13. Software Pipelined Dot Product — No Prologue or Epilogue 


[Al] 


[Al] 
[Al] 


[Al] 
[Al] 


MVK otSib 
ADD S 
ZERO 1 
ZERO 12 
ADD oSl 
B ao 
ZERO 

ZERO 12 
ADD oil 
B S12 
ZERO odinlk 
ZERO ~L2 
ADD Sil 
B oO 
ADD StSulb 
B ot 
ADD onl 
B oe 
ADD L1 
ADD L2 
MPY M1X 
MP YH M2X 
ADD ol 
B -S2 
LDW D1 
LDW -D2 


87), AL 


-1,A1,Al 
Al 
B7 


iL, 2M; NAL 
LOOP 

Ao 

B6é 


=1L, Al aA 
LOOP 

A2 

B2 


SIL, JM p al 
LOOP 


1b 5 2Mb 7A 
LOOP 


Sl, 7M 5 All 
LOOP 


A6,A7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 
-1,A1,Al 
LOOP 
*A4++,A2 
*B4++,B2 


; Branch occurs here 


ADD .L1X 


A7,B7,A4 


v 


v 


v 


;***x* decrement loop counter 
pews IOIESIAGIN ic@) Loos 


7; x*xx**x decrement loop counter 


piss lorpeiaGla ico Ieee 


set up loop counter 


decrement loop counter 
zero out sum0 accumulator 
zero out suml accumulator 


© EF 


* decrement loop counter 
branch to loop 

zero out add input 

zero out add input 


© 


** decrement loop counter 
Keploracine hy tOmmoOop 

zero out mpy input 

zero out mpy input 


**x*x decrement loop counter 


C1 EC) RC) ee 


paxictsss Joreaiaela ite) ILoeys 


’ 
’ 


’ 


;** ai * bi 


sum0 += (ai * bi) 
suml += (aitl * bit+l1) 


pee ait oe Br 

7 **x*x*** decrement loop counter 
7***** branch to loop 

7xxxxee* Td ai & aitl fm memory 
7 xxxaee* Td bi & bitl fm memory 


’ 


sum = sum0O + suml 
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4.3.7 Removing Extra SUB Instructions 


Because you know that the loop count is at least 6, you can eliminate the extra 
SUB instructions as shown in Example 4-14. 


(1 The first five branch instructions are made unconditional, since they 
always execute. (If you do not know that the loop count is at least 6, you 
must keep the SUB instructions that decrement before each conditional 
branch as in Example 4-13.) 


(j) Based on the elimination of six SUB instructions, the loop counter is now 
51 (57 — 6). 


This code shows some improvement over Example 4-13. The loop in 
Example 4—14 requires 63 cycles (5 + 57 + 1). 


Example 4-14. Software Pipelined Dot Product With Smallest Code Size 


B 52 LOOP 7 branch to loop 
ht MVK one 51,Al1 ; set up loop counter 
B Se LOOP 7* branch to loop 
B :52 LOOP 7** branch to loop 
ZERO sod A7 ; zero out sum0 accumulator 
ZERO ~L2 B7 ; zero out suml accumulator 
B x82 LOOP 7*** branch to loop 
ZERO -L1 A6é ; zero out add input 
ZERO yli2 B6é , zero out add input 
eo2 LOOP 7**** branch to loop 
ZERO -L1 A2 ; zero out mpy input 
ZERO L2 B2 ; zero out mpy input 
LOOP 
ADD -L1 A6,A7,A7 ; sum0O += (ai * bi) 
| | ADD -L2 B6,B7,B7 ; suml += (ait+l * bitl) 
| | MPY .M1X A2,B2,A6 Be Sede RP 
I | MPYH 2X A2,B2,B6 7** ait+tl * bitl 
| | [Al] ADD oul =—1, AL,AL 7; xx*xx** decrement loop counter 
|| [Al] B <2 LOOP 7***** branch to loop 
| LDW -D1 *A4++,A2 7 xxxxxe* Td ai & aitl fm memory 
| LDW DZ *B4++,B2 7 xxxxxe* Td bi & bitl fm memory 
; Branch occurs here 
ADD .L1X A7,B7,A4 ; sum = sum0 + suml 
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Software Pipelining 


Table 4—5 compares the performance of all versions of the dot-product code. 


Table 4—5. Comparison of Dot-Product Code Examples 


Code Example 


Example 4-4 Serial Assembly 


Example 4-5 Parallel Assembly 
Example 4-9 Dot-Product Assembly With LDW 

Example 4-11 Software Pipelined Dot Product 

Example 4-12 Software Pipelined Dot Product With No Extraneous Loads 
Example 4-13 Software Pipelined Dot Product — No Prologue or Epilogue 


Example 4-14 Software Pipelined Dot Product With Smallest Code Size 


100 Iterations 
2+100 x 16 


1+100 x 8 


1+(50 x 8)+1 


7+50+1 
7+434+7+4+1 
7+57+1 


5+57+1 


Cycle Count 
1602 


801 
402 
58 
58 
65 
63 
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4.4 Modulo Scheduling of Multicycle Loops 


Section 4.3 demonstrated the modulo-scheduling technique for the dot- 
product code. In that example of a single-cycle loop, none of the instructions 
used the same resources. When you have multicycle loops, resource conflicts 
can affect modulo scheduling. 


4.4.1. Weighted Vector Sum C Code 


Example 4—15 shows the C code for a weighted vector sum. 


Example 4-15. Weighted Vector Sum C Code 


void w_vec(short a[],short b[],short c[],short m) 
{ 


aig heleabers 


for (2=Of ax<i@Op acy) 4 
Gi] = (Gm * elu) SS 15) sv islals 
} 


4.4.2 Translating the Inner Loop to ’C62xx Instructions 


Example 4—16 shows the symbolic instructions for the weighted vector sum 
instructions that execute the inner loop (m x a/) in Example 4-15. 


Example 4-16. List of Symbolic Weighted Vector Sum Instructions 


LDH *aptrt+t+,ai ar 

LDH *bptrt++,bi + obL 

MPY il, Bab, FO Pal te eh 

SHR jou, LS Secale! e (im = geal SS 15 

ADD pi_scaled,bi,ci > el = (im * ad) SS 15 + lot 

SES) Ci, *@joic esse - Shorea. 
[cntr] SUB entr,1,cntr ; decrement loop counter 
[cntr]B LOOP 7 branch to loop 


4.4.3. Determining the Minimum Iteration Interval 


Example 4—16 includes three memory operations in the inner loop (two LDHs 
and the STH) that must each use a .D unit. Because only two .D units are avail- 
able on any single cycle, this loop requires at least two cycles. Since no other 
resources are used more than twice, the minimum iteration interval for this 
loop is 2. Because memory operations are determining the minimum iteration 
interval, unrolling the loop and performing LDWs can help improve the perfor- 
mance. 


4-28 


Modulo Scheduling of Multicycle Loops 


4.4.4 Unrolling the Weighted Vector Sum C Code 


Example 4—17 shows the C code for an unrolled version of the weighted vector 
sum. 


Example 4-17. Weighted Vector Sum C Code 


void w_vec(short a[],short b[],short c[],short m) 


{ 


a ah a 


for (i=0; i<100; i+=2) { 
ela = (Gn > alfa) Ss 15) + loa: 
Saitek) (tam 2 eulfaisPal ip) Ss aLS}) Ae Yoy|patsbil]) 9 
} 


4.4.5 Translating Unrolled Inner Loop to ’C62xx Instructions 


Example 4—18 shows a the weighted vector sum instructions that calculate c/i] 
and cfi+1] in Example 4-17. 


(1 The two store pointers (*ciptr and *ci+1ptr) are separated so that one 
(*ciptr) increments by 2 through the odd elements of the array and the 
other (*ci+7ptr) increments through the even elements. 


[1 AND and SHR separate bi and bi+1 into two separate registers. 


Lj This code assumes that mask is preloaded with OxOOOOFFFF to clear the 
upper 16 bits. The shift right of 16 places bi+7 into the 16 LSBs. 


Example 4—18. List of Symbolic Weighted Vector Sum Instructions Using LDW 


LDW *aptrt+,ai_itl ; ai & aitl 
LDW *bptrt++,bi_itl ; bi & bitl 
MPY. iN, Galak ze dl, jovab igh Shab 
MIENAaNL, iil, eat atari, Tests IL ig @labsr 


(@ = asjy SS T5 
(Gat © elstshil)) Ss a5) 


SHR pi,15,pi_scaled 
SHR jOutaril , ILS); joitar L_@e@ellecl 
D bi_it+1,mask,bi 
R Joyal 5 16, leat sPal 
ADD jul —SCeulScl, lost. @al 
D 
H 
H 


(Gin, oS Gla) “S38 GLS) Sp loyal 

P eabaeil > (Gq elabapiby ess) aS) ae Joataral 
; store ci 

S Sirowe C@alsril 

; decrement loop counter 

; branch to loop 


(Oulsril_ S@evlecl, lowered , €atsr iL 
Gal, “CMOIE ser | Z | 
Galstil , ease ior iesese | 2 | 
[centr] SUB ehtr; 1,-cntr 
[cntr]B LOOP 


<0 5. i ~~ 
lon 
fae 
i) = 
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4.4.6 Determining a New Minimum Iteration Interval 


Use the following considerations to determine the minimum iteration interval 
for the assembly instructions in Example 4-18. 


[1 Four memory operations (two LDWs and two STHs) must each use a .D 
unit. With two .D units available, this loop still requires only two cycles. 


_j Four instructions must use the .S units (three SHRs and one branch). With 
two .S units available, the minimum iteration interval is still 2. 


_j The two MPYs do not increase the minimum iteration interval. 


[J Since the remaining four instructions (two ADDs, AND, and SUB) can all 
go on a.L unit, the minimum iteration interval for this loop is the same as 
in Example 4-16. 


By using LDWs instead of LDHs, the program can do twice as much work in 
the same number of cycles. 


4.4.7 Dependency Graph 


To achieve a minimum iteration interval of 2, you must put an equal number 
of operations per unit on each side of the dependency graph. Three operations 
in one unit on a side would result in an minimum iteration interval of 3. 


Figure 4—6 shows the dependency graph divided evenly with a minimum inter- 
ation interval of 2. 


Modulo Scheduling of Multicycle Loops 


Figure 4-6. Dependency Graph of Weighted Vector Sum 


A side 
LDW 


B side 


ai & ai+1 


| 


R 
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4.4.8 Allocating Resources 


Using the dependency graph, you can allocate functional units and registers 
as shown in Example 4-19. This code is based on the following assumptions: 


Lj The pointers are initialized outside the loop. 
_} mresides in B6, which causes both .M units to use a cross path. 
[J The mask in the AND instruction resides in B10. 


Example 4—19. List of Actual Weighted Vector Sum Instructions 


LDW 5 Dil *A4++,A2 ; ai & aitl 

LDW DZ *B4++,B2 ; bi & bitl 

MPY .M1X A2,B6,A5 aT ad. 

MPYHL .M2X A2,B6,B5 ; m * aitl 

SHR 5 Sil A5,15,A7 >; (m * ai) >> 15 

SHR aoe B5,-15,,B7 > (m * aitl) >> 15 

AND 6 li, B2,B10,B8 2 DL 

SHR SZ B2,16,Bl1 * bitl 

ADD plnnlexe A7,B8,A9 7; cil = (m * ai) >> 15 + bi 

ADD 5 ly?) B7,B1,B9 7 citl = (m * aitl) >> 15 + bitl 

STH oi A9, *A6++4+[2] ; store ci 

STH 51D) B9, *BO++[2] ; store citl 
[Al] SUB Paleyle Al,1,Al1 ; decrement loop counter 
[Al] B Peovil LOOP ; branch to loop 
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4.4.9 Modulo Iteration Interval Scheduling 


Table 4-6 provides a method to keep track of resources which are a modulo 
iteration interval away from each other. In the single-cycle dot-product exam- 
ple, every instruction executed every cycle and, therefore, required only one 
set of resources. Table 4—6 includes two groups of resources, which are 
necessary because you are scheduling a two-cycle loop. 


L 


Instructions that execute on cycle k also execute on cycle k + 2, k + 4, etc. 
Instructions scheduled on these even cycles cannot use the same 
resources. 


Instructions that execute on cycle k + 1 also execute on cycle k + 3, 
k +5, etc. Instructions scheduled on these odd cycles cannot use the same 
resources. 


Because two instructions (MPY and ADD) use the 1X path but do not use 
the same functional unit, Table 4—6 includes two rows (1X and 2X) that 
help you keep track of the cross path resources. 


Only seven instructions have been scheduled in this table. 


L] 
Lj 


The LDW uses the .D1 unit on the even cycles. 


The MPY and MPYH are scheduled on cycle 5 because the LDW has four 
delay slots. The MPY instructions appear in two rows because they use 
the .M and cross path resources on the cycles 5, 7, 9, etc. 


The two SHR instructions are scheduled two cycles after the MPY to allow 
for the MPY’s single delay slot. 


The AND is scheduled on cycle 5, four delay slots after the LDW. 
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Table 4-6. Weighted Vector Sum Modulo Iteration Interval Table (2-Cycle loop) 


KKK KKK 


LDW ai_i+1 LDW ai_i+1 


KKK KKK 


*k 


LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 


a MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 


SHR pi_s SHR pi_s SHR pi_s 
SHR pi+1_s SHR pi+1_s SHR pi+1_s 
MPY pi MPY pi MPY pi MPY pi 


Lt MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 


The asterisks indicate the iteration of the loop; shaded cells indicate cycle 0. 


Note: 
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4.4.10 Resource Conflicts 


Resources from one instruction cannot conflict with resources from any other 
instruction scheduled modulo iteration intervals away. In other words, for a 
two-cycle loop, instructions scheduled on cycle n cannot use the same re- 
sources as instructions scheduled on cycles n+ 2,n+4,n+6, etc. Table 4—7 
shows the addition of the SHR bi+7 instruction. This must avoid a conflict of 
resources in cycles 5 and 7, which are one iteration interval away from each 
other. 


Even though LDW bji_i+7 (.D2, cycle 0) finishes on cycle 5, its child, SHR bi+1 
cannot be scheduled on .S2 until cycle 6 because of a resource conflict with 
SHR pi+1_s, which is on .S2 in cycle 7. 
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Table 4—7. Modulo Iteration Interval Table With SHR Instructions 


Unit\Cycle 0 2 


4 | 6 | 10, 12, 14, | 
aa LDW ai_i+1 | LDWai_i+1 | LDWai_ist | LDWai_i+1 | LDWai_i+1 LDW ai_i+1 
ne LDW bi_i+1 | LDWbi_i+1 | LDWbi_ist | LDWbii+1 | LDW bi_i+1 LDW bi_i+1 


M1 
.M2 
.L1 


1X 
2X 


.D1 

.D2 

.M1 : ‘ : : 
MPY pi MPY pi MPY pi MPY pi 


ane MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
.L1 ; : ; : 
AND bi AND bi AND bi AND bi 


St 


kk 


SHR pi_s SHR pi_s SHR pi_s 


* kk 


SHR pi+1_s SHR pi+1_s SHR pi+1_s 
1X : é : : 
MPY pi MPY pi MPY pi MPY pi 
2X : F ‘ F 
MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 


Note: The asterisks indicate the iteration of the loop. 


S2 
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4.4.11 Live Too Long 


No value can be live in a register for more than the number of cycles in the loop. 
Otherwise, iteration n + 1 writes into the register before iteration n has read that 
register. Therefore, if in a two-cycle loop, a value is written to a register at the 
end of cycle n, then all children of that value must read the register before the 
end of cycle n + 2. 


The ADD ciin Figure 4—6 demonstrates a /ive-too-long problem. The parents 
of ADD ci (AND bi and SHR pi_s) are scheduled on cycles 5 and 7, respec- 
tively. 


[1 Since the SHR pi_sis scheduled on cycle 7, the earliest you can schedule 
ADD ciis cycle 8. 


[1 Incycle 7, AND bi* writes b/for the next iteration of the loop, which means 
if you schedule ADD cion cycle 8, it reads the parent value of bifor the next 
iteration, which is incorrect. This situation illustrates the live-too-long 
problem. 


4.4.12 Solving the Live-Too-Long Problem 


The live-too-long problem in Table 4—7 means that the b/ value would have to 
be live from cycles 6-8, or 3 cycles. No loop variable can live longer than the 
iteration interval, because a child would then read the parent value for the next 
iteration. 


To solve this problem, Figure 4—7 and Table 4—8 show that AND bi has been 
moved to cycle 6 so that you can schedule ADD ci to read the correct value 
on cycle 8. 
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Figure 4—7. Dependency Graph of Weighted Vector Sum 
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Table 4—8. Weighted Vector Sum Modulo Iteration Interval Table (2-Cycle loop) 


Unit\Cycle 0 


pa LDW ai_i+1 | LDWai_i+1 | LDWai_ist | LDWaii+1 | LDWai_i+1 LDW ai_i+1 
we LDW bi_i+1 | LDWbi_i+1 | LDWbi_ist | LDWbi_i+1 | LDW bi_i+1 LDW bi_i+1 


SHR bi+1 SHR bi+1 SHR bi+1 


.M2 
-L1 
.L2 


1X 
.D1 


.D2 


5 


= 


= 


el 


rc 


* *k 


® 


op) 
ne) 


MPY pi 


i) 
x< 


. 


MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 


Note: The asterisks indicate the current iteration of the loop iteration 0 through iteration 5; shading indicates changes in 
scheduling from Table 4—7. 
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4.4.13 Scheduling the Remaining Instructions 
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Figure 4—8 shows the dependency graph with additional changes. The final 
version of the loop, with all instructions scheduled correctly, is shown in 
Table 4-9. 


Lj Table 4—9 shows the following additions: 


B LOOP (.S1, cycle 6) 
SUB (.L1, cycle 5) 

ADD ci+1 (.L2, cycle 10) 
STH ci (cycle 9) 

STH ci+7 (cycle 11) 


(J To avoid resource conflicts and live-too-long problems, Table 4—9 also 
includes the following additional changes: 


m LDW Di_i+7 (.D2) moved from cycle 0 to cycle 2. 
m AND bi(.L2) moved from cycle 6 to cycle 7. 

m SHR pi+1_s (.S2) moved from cycle 7 to cycle 9. 
m MPYHL moved from cycle 5 to cycle 6. 

Mm SHR bi+7 moved from cycle 6 to 8. 


From the table, you can see that this loop is pipelined six iterations deep, since 
iterations n and n + 5 execute in parallel. 


Modulo Scheduling of Multicycle Loops 


Figure 4-8. Dependency Graph for Scheduling ci+1 (Weighted Vector Sum) 


A side 
LDW 


B side 


| 
"| 


Optimizing Assembly Code 4-41 


Modulo Scheduling of Multicycle Loops 


Table 4-9. Weighted Vector Sum Modulo Iteration Interval Table (2-Cycle loop) 


Unit\Cycle 
ee LDW ai_i+1 | LDWai_i+t 
—. LDW bi_i+1 
Mi 
Me MPYHL pi+1 | MPYHL pi+1 | MPYHL pi+1 
a ADD ci ADD ci 
a ADD ci+1 
‘St B LOOP 
‘S2 SHR bi+1 SHR bis 
i ADD ci ADD ci 
vr pit | avis | mv it | 


7 STH ci STH ci 

A STH ci+4 

ue MPY pi MPY bi MPY pi MPY pi 
ee ee ee eee 
v 
fe ee ee ed 


a 


Note: The asterisks indicate the current iteration of the loop iteration 0 through iteration 5; shaded cells indicate cycle 0. 
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4.4.14 Final Assembly 
Example 4—20 lists the final assembly for the weighted vector sum. 


(1 ~While iteration n of instruction STH ci+7 is executing, iterationn+ 1 of STH 
ciis executing. To prevent the STH c/instruction from executing iteration 
51 while STH ci + 7 executes iteration 50, schedule the final executions 
of ADD ci+7 and STH ci+7 after exiting the loop and execute the loop only 
49 times. 


[j The mask for the AND instruction is created with MVK and MVKH in paral- 
lel with the loop prologue. 


Lj The pointer to the odd elements in array cis also set up in parallel with the 
loop prologue. 
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Example 4-20. Weighted Vector Sum 


LDW .D1 *A4++,A2 
ADD .L2X  A6,2,B0 
LDW .D2 *B4++,B2 
| | LDW .D1 *R4++,A2 
MVK 82 -1,B10 
LDW .D2 *B4++,B2 
LDW .D1 *R4++,A2 
VK .S1 49,Al 
VKH .S2 0,B10 
PY -M1X A2,B6,A5 
[Al] SUB .L1 Al,1,Al1 
PYHL .M2X A2,B6,B5 
[Al] B .S1 LOOP 
LDW .D2 *B4++,B2 
LDW .D1 *R4++,A2 
SHR .S1 A5,15,A7 
AND .L2 B2,B10,B8 
MPY -M1X A2,B6,A5 
[Al] SUB Ll Al,1,Al1 
SHR 3$2 B2,16,Bl1 
ADD L1X AT7,B8,A9 
MPYHL .M2X A2,B6,B5 
[Al] B .S1 LOOP 
LDW .D2 *B4++,B2 
LDW .D1 *R4++,A2 
SHR .S2 B5,15,B7 
STH .D1 AQ, *A6+4+[2] 
SHR .S1 A5,15,A7 
AND L2 B2,B10,B8 
[Al] SUB Ll Al i, AL 
MPY 1X A2,B6,A5 
LOOP 
ADD 2 B7,B1,B9 
| SHR ~S2 B2,16,Bl 
| | ADD .L1X  A7,B8,A9 
| | MPYHL .M2X A2,B6,B5 
|| {A1] B SSL LOOP 
| | LDW D2 *B4++,B2 
| | LDW D1 *A4++,A2 


; ai & aitl 


; set pointer to cit+l 
7 bi & bitl 
7;* al & aitl 


set to all ls (OxFFFFFFFF) 


, 
;* bi & bitl 

7** ai & aitl 

set up loop counter 

clr upper 16 bits (Ox0000FFFF) 


, 


, 


m * ai 
decrement loop counter 


m * aitl 

7 branch to loop 
7** bi & bitl 
7*** ai & aitl 


a 


(mM *-ai)y >> 15 

7 bi 

2 om * sak 

7* decrement loop counter 


bitl 

ene (m * ai) 
Ph me aired 

7* branch to loop 
7*** bi & bitl 
RRA E SAT Goer at AL: 


, 


>> 15 + bi 


; (m * aitl) >> 15 
, store ci 

7* (m * ai) >> 15 
7* bi 


7** decrement loop counter 
pee me ® ak 


, citl = (m * aitl) >> 15 + bitl 
7* Dati 
7* ci = (m * ai) >> 15 + bi 


pee om * ark 

;** branch to loop 
pRe** OL & bitl 
pReee* cad &raitl 
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STH 
SHR 
STH 
SHR 
AND 
SUB 
MPY 


D2 
~S2 


B9, *BO++[2] 
B5,15,B7 
A9, *A6++[2] 
A5,15,A7 
B2,B10,B8 
Al,1,Al1 
A2,B6,A5 


; Branch occurs here 


ADD 


STH 


L2 


-D2 


B7,B1,B9 


B9, *BO 


’ 


o* 


’ 


store citl 
(m * ait+l) 


;* store ci 
7** (m * ai) 
pee ba 

7*** decrement loop counter 
peee Te al 


’ 


v 


oe 15 


>> 15 


cit+l = (m * aitl) >> 15 + bitl 


store citl 
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4.5 Loop Carry Paths 


Loop carry paths occur when one iteration of a loop writes a value that must 
be read by a future iteration. A loop carry path can affect the performance of 
a software pipelined loop that executes multiple iterations in parallel. Some- 
times loop carry paths (instead of resources) determine the minimum iteration 
interval. 


IIR filter code contains a loop carry path, where output samples are used as 
input to the computation of the next output sample 

4.5.1 IIR Filter C Code 
Example 4—21 shows C code for a simple IIR filter. 


Li yfi]is an input to the calculation of y/ i+7 ]. 


Lj Before y//]can be read for the next iteration, y/ i+7 ] must be computed 
from the previous iteration. 


Example 4-21. IIR Filter C Code 


void iir(short x[],short y[],short cl, short c2, short c3) 
{ 


aia hohe 


coe (SOF sa<ilOOp ate) 4 
Whisk) = (eis) s+ @2*s< aril] sb GS wv lal))) S= Las 
} 


4-46 


Loop Carry Paths 


4.5.2 Symbolic ’C62xx Instructions (Inner Loop) 


Example 4—22 shows symbolic ’C62xx instructions that execute the inner 
loop. 


Lj xptris not post incremented after loading xi+1, since xiof the next iteration 
is actually xi+7 of the current iteration. Thus, the pointer points to the same 
address when loading both xi+7 for one iteration and xi for the next 
iteration. 


Lj yptr is also not post incremented after storing yi+7, since yi of the next 
iteration is yi+7 for the current iteration. 


Example 4-22. List of Symbolic IIR Instructions 


LDH *xptrt+, xi Sed 
MPY cl,xi,p0 eel FR 
LDH EGON I > Seatse IL Po Sealcrall 
MPY c2,xit+l,pl 6 CO RS KEEL 
ADD p0,pl,s0 ee a aR a FS, 
LDH PAVIONE TO tr any Wale e spal 
MPY c3,yi,p2 * 63 OR ty 
ADD s0,p2,sl ee SO. FS ths OZ) ET OS oF ay a 
SHR s1,15,yitl ; yitl 
STH yit+l, *yptr ; store yitl 
[centr] SUB cntr,; Lycntr ; decrement loop counter 
[cntr]B LOOP ; branch to loop 
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4.5.3 Dependency Graph 


Figure 4—9 shows the dependency graph for the IIR filter. 


Lj 
L} 


A loop-carry path exists from the store of yi+7 to the load of yi. 


The path between the STH and the LDH is one cycle because the load and 
store instructions use the same memory pipeline. Therefore, if a store is 
issued to a particular address on cycle n and a load from that same ad- 
dress is issued on the next cycle, the load reads the value that was written 
by the store instruction. 


Figure 4—9. Dependency Graph Of IIR Filter 
A Side B Side 
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Loop Carry Paths 


4.5.4 Minimum Iteration Interval 


To determine the minimum iteration interval, you must consider both resources 
and data dependency constraints. 


[) Based on resources in Table 4—10, the minimum iteration interval is 2 
because the total non-.M units on the A side is 5 and no other units are 
used more than twice. 


(1 The IIR has a data dependency constraint defined by its loop carry path. 
Figure 4—9 shows that if you schedule LDH yi on cycle 0: 


m The earliest you can schedule MPY p2 is on cycle 5. 
m The earliest you can schedule ADD s7 is on cycle 7. 
m SHR must be on 8 and STH onong. 
a 


Because the LDH must wait for the STH to be issued, the earliest the 
the second iteration can begin is cycle 10. 


To determine the minimum loop carry path, add all of the numbers along 
the loop paths in the dependency graph. This means that this loop carry 
path is 10(5+2+1+1+4+1). 


Although the minimum iteration interval is the greater of the resource limits and 
data dependency constraints, an interval of 10 seems slow. Figure 4—10 
shows how to improve the performance. 


Table 4-10. Resource Table For IIR Filter 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit | Unit(s) Instructions Total/Unit 
-M1 2 MPYs 2 | -M2 MPY 1 

S1 B 1 .S2 SHR 1 

-D1 2 LDHs 2 .D2 STH 1 
-L1,.S1, or .D1 ADD & SUB 2 .L2 or .S2,.D2 ADD 1 
Total non-.M units 5 Total non-.M units 3 
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4.5.5 New Dependency Graph 
Figure 4—10 shows a new graph with a loop carry of 4 (2 +1 + 1). 


L1 Since the MPY p2 instruction can read yi+7 while it is still in a register, you 
can reduce the loop carry path by six cycles. 


Lj With the LDH of y/no longer in the graph, you can issue the LDH of y/0] 
once outside the loop. Every iteration after that, the y+7 values written by 
the SHR instruction are valid y inputs to the MPY instruction. 


Figure 4—10. Dependency Graph of IIR Filter With Smaller Loop Carry 


B side 


A side 


Loop carry path: 
2+1+1=4 
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4.5.6 New Symbolic ’C62xx Instructions (Inner Loop) 


In Example 4—23, you no longer have LDH yi. The one variable y that is read 
and written is yi for the MPY p2 instruction and yi+7 for the SHR and STH 
instructions. 


Example 4-23. List of Symbolic IIR Instructions With Reduced Loop Carry Path 


LDH *xptrt+, xi ae ate a 
MPY cl,xi,p0 Be Lek xa 
LDH *xptr,xit+l e. Mot 
MPY c2,xit+l,pl peek KT 
ADD p0,pl,s0 oh) ea ea: ee 
MPY O37 WeilZ Best ah Wak 
ADD s0,p2,sl ee OR a te Re ne 3h Foy 
SHR Sul, Aley B syalandl 
Syiliel Wig Visit aiearae B fsneoies) syalaril 
[centr] SUB ehtr; 1,.cntr ; decrement loop counter 
[cntr]B LOOP ; branch to loop 


4.5.7 Allocating Resources 


Example 4—24 lists the ‘C62xx instructions with the functional units and regis- 
ters that are used in the inner loop. 


Example 4-24. List of Actual IIR Instructions 


LDH D1 *A4++,A2 ; Xitl 
MEY .M1 A6,A2,A5 PGI) * XR 
LDH 5 Dil *A4,A3 ; Xitl 
MPY 1X  B6,A3,A7 ; c2 * xitl 
ADD ¢ Udl A5,A7,A9 g Ole * sek et 2X ead 
MPY ~-M2X A8,B2,B3 PEs * Yk 
ADD yang B3,A9,B5 PCL Set) c2°% xt th e3,  aya 
SHR moe B5,15,B2 7; yitl 
STH D2 B2,*B4++ ; store yitl 
[Al] SUB 6 Mdl Al,1,Al1 ; decrement loop counter 
[Al] B sSil LOOP ; branch to loop 
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4.5.8 Modulo Iteration Interval Scheduling 
Table 4-11 shows the modulo iteration interval table for the IIR. The SHR on 


cycle 10 finishes in time for the MPY p2 instruction from the next iteration to 
read its result on cycle 11. 


Table 4-11. IIR Modulo Iteration Interval Table (4-Cycle loop) 


* *k 


LDH xi+1 LDH ci+1 


ADD sO 
.M1 
.M2 
L1 : SUB centr SUB cntr 
.L2 .L2 ADD s1 
S1 S1 
.S2 .S2 


1X | 1X 
D1 


2 ee ee eee ae ee 


— MPY p1 MPY pt — 
M2 


u 

ee ce 

St BOOP || -Ayioup 

oe SHR yi+1 | 

1 MPY pt MPY of es 

2x | 2x : 


MPY p2 MPY p2 


Note: The asterisks indicate the current iteration of the loop iteration 0 through iteration 2. 


4-52 


Loop Carry Paths 


4.5.9 Final Assembly 


Example 4—25 shows the final assembly for the IIR filter. With one load of y/0/ 
outside the loop, no other loads from the y array are needed. Example 4—25 
requires 408 cycles: (4x 100) + 8. 


Example 4-25. IIR Filter 


LDH -D1 *A4++,A2 ee 
LDH -D1 *A4,A3 es ee. 
LDH .DZ *B4++,B2 ; load y[0] outside of loop 
MVK $1 100,Al1 ; set up loop counter 
LDH .D1 *D4++,A2 ;* xi 
[Al] SUB ~L1 Al,1,Al1 ; decrement loop counter 
| | MPY -M1 A6,A2,A5 OGL XL 
|| LDH .D1 *A4,A3 3;* xitl 
MPY .M1X  B6,A3,A7 7 c2 * xitl 
|| [A1] B Ss LOOP ; branch to loop 
MPY .M2X  A8,B2,B3 ; c3 * yi 
LOOP: 
ADD ~L1 A5,A7,A9 pel * xb + C2 * xitl 
| LDH .D1 *DA++,A2 pee xi 
ADD ~L2X B3,A9,B5 PGl, * ek, sbee2- ee KL: SS. yd 
| [Al] SUB ~L1 Al1,1,Al1 7* decrement loop counter 
| MPY «ML A6,A2,A5 PE CLK 
| LDH .D1 *A4,A3 ;** xitl 
SHR #52 B5,15,B2 7; yitl 
| MPY .M1X  B6,A3,A7 3* c2 * xitl 
| [A1] B Peele LOOP 7* branch to loop 
STH D2 B2, *B4++ ; store yitl 
I {| MPY ~M2X A8,B2,B3 Pe CS Fava. 
; Branch occurs here 
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4.6 lf-Then-Else Statements in a Loop 
lf-then-else statements in C cause certain instructions to execute when the if 
condition is true or other instructions to execute when it is false. One way to 
accomplish this on the ’C62xx is with conditional instructions. Since all ’C62xx 
instructions can be conditional on one of five general-purpose registers, condi- 
tional instructions can handle both the true and false cases of the if-then-else 
C statement. 


4.6.1 If-Then-Else C Code 


Example 4—26 contains a loop with an if-then-else statement. You either add 
a[i] to sum or subtract afi] from sum. 


Example 4-26. If-Then-Else C Code 


int if_then(short a[], int codeword, int mask, short theta) 


{ 


int i,sum, cond; 


sum = 0; 
for (i = 0; i < 32; itt) { 
cond = codeword & mask; 


if (theta == !(!(cond))) 
sum += ali]; 

else 
sum -= a[il]; 


mask = mask << 1; 


} 


return (sum); 


} 


4.6.2 Branching vs. Conditional Instructions 

Branching is one way to execute the if-then-else statement: 
Lj Branch to the ADD when the /f statement is true 
[1 Branch to the SUB when the if statement is false 


Because each branch has five delay slots, this method requires additional 
cycles, and branching within the loop makes software pipelining almost 
impossible. 


Conditionals avoid having to branch to the appropriate piece of code after 
checking whether the condition is true or false. 


(J Program both the ADD and SUB as usual but make them conditional on 
the zero and nonzero values of a condition register. 


Lj This method also allows you to software pipeline the loop and achieve 
much better performance than you would with branching. 
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4.6.3 *C62xx Instructions (Inner Loop) 


Example 4—27 lists the ’C62xx instructions needed to execute the C code in 
Example 4-26. 


L1 Ifthe result of the bitwise AND is nonzero, a 1 is written into cond. 
Li Aconditional MVK performs the /(/(cond)) C statement. 

m lf the result of the AND is 0, cond remains at 0. 

m lf the result of the AND is nonzero, cond is changed to 1. 
L) CMPEQ is used to create if. 
Lj The ADD is conditional when /fis nonzero (corresponds to then). 
[1 The SUB is conditional on when ifis 0 (corresponds to else). 


Example 4-27. List of Symbolic If-Then-Else Instructions 


AND codeword,mask,cond ; cond = codeword & mask 
cond] MVK 1,cond Poh (eond)) 

CMPEQ theta,cond,if ; (theta == !(! (cond) )) 

LDH xaptrt+t+,ai > ale) 
if] ADD sum, ai,sum ; sum += a[il 
!'i£] SUB sum, ai,sum 7; sum -= a[il] 

SHL mask,1,mask ; mask = mask << 1; 
cntr] ADD =L,entr,entr ; decrement counter 
cntr]B LOOP ; for LOOP 
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4.6.4 Dependency Graph 


Figure 4—11 shows the dependency graph for the if-then-else C code. 


Lj Two nodes on the graph contain sum: one for the ADD and one for the 
SUB. Because some interations are performing an ADD and others are 
performing a SUB, each of these nodes is a possible input to the next itera- 
tion of either node. 


[J The LDH diinstruction is a parent of both ADD sum and SUB sum, since 
both instructions read ai. 


L1 CMPEQ ifis also a parent to ADD sum and SUB sum, since both read if 
for the conditional execution. 


1 The result of SHL mask is read on the next iteration by the AND cond 
instruction. 


Figure 4—11. Dependency Graph of If-Then-Else Code 


A side 
SHL 


B side 
AND 
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4.6.5 Minimum Iteration Interval 


With nine instructions, the minimum iteration interval is at least 2, since a 
maximum of eight instructions can be in parallel. Based on the way the depen- 
dency graph in Figure 4—11 is split, five instructions are on the A side and four 
are on the B side. Because none of the instructions are MPYs, all instructions 
must go on the .S, .D, or .L units, which means you have a total of six 
resources. 


_j LDH must be on a.D unit. 

[1 SHL, B, and MVK must be on a.S unit. 

[1 The ADDs and SUB can be on the .S, .L, or .D units. 
[1 The AND can be ona.S or .L unit. 


From Table 4—12, you can see that no one resource is used more than two 
times so that the minimum iteration interval is still 2. 


The minimum iteration interval is also affected by the total number of instruc- 
tions. Since three units can perform nonmultiply operations on a given side, 
a total of five instructions can be performed with a minimum iteration interval 
of 2. Since only four instructions are on the B side, the minimum iteration inter- 
val is still 2. 


Table 4-12. Resource Table for If-Then-Else Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit | Unit(s) Instructions Total/Unit 

M1 0 | M2 0 

S1 SHL & B 2 .S2 MVK 1 

-D1 LDH 1 -L2 CMPEQ 1 

.L1,.S1,or.D1 ADD &SUB 2 .L2 or .S2 AND 1 
.L2,.S2,or.D2 ADD 1 

Total non-.M units 5 Total non-.M units 4 


Optimizing Assembly Code 4-57 


If-Then-Else Statements in a Loop 


4.6.6 Allocating Resources 


Now that the graph is split and you know the minimum iteration interval, you 
can allocate functional units and registers to the instructions. You must ensure 


that no resource is used more than twice. 


Example 4—28 shows the ’C62xx instructions with the functional units. 


Example 4-28. List of Actual If-Then-Else Instructions 


AND .S2X  B4,A6,B2 
[B2] MVK .S2 1,B2 
CMPEQ .L2 B6,B2,B1 
LDH D1 *D44++,A5 
[Bl] ADD L1 A7,A5,A7 
[!B1] SUB .D1 A7,A5,A7 
SHL .S1 A6,1,A6 
BO] ADD L2 —1,B0,B0 
[BO] B .S1 LOOP 


cond = codeword & mask 
!(! (cond) ) 

(theta == !(! (cond) )) 
a[i] 

sum += a[il] 

sum -= a[il] 


; mask = mask << 1; 


decrement counter 
for LOOP 
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4.6.7 Final Assembly With Software Pipelining 


Example 4—29 shows the final assembly code after software pipelining. The 
performance of this loop is 70 cycles (2 xX 32 + 6). 


Example 4-29. If-Then-Else Assembly 


MVK -S2 32,B0 ; set up loop counter 
[BO] ADD -L2 “1,80, BO ; decrement counter 
[BO] ADD -L2 =1,5B0;3B0 ; decrement counter 
[BO] B Poul LOOP ; for LOOP 
LDH .D1 *DR44++,A5 ; aflil 
SHL .S1 Ao,1,A6 ; mask = mask << 1; 
AND wOZ2X B4,A6,B2 ; cond = codeword & mask 
{[B2] MVK “52 1,B2 es £0) Ceond) ) 
[BO] ADD ~L2 ~L, BO, BO ; decrement counter 
[BO] B etl LOOP ;* for LOOP 
LDH .D1 *DR44++,A5 ;* ali] 
CMPEQ .L2 Bo,B2,Bl1 ; (theta == !(! (cond) )) 
SHL Powe A6o,1,A6 7;* mask = mask << 1; 
AND ~S2X B4,A6,B2 ;* cond = codeword & mask 
ZERO Pa ah A7 yj; zero out accumulator 
LOOP: 
[BO] ADD 7 -1., BO,B0 ; decrement counter 
[B2] MVK 82 1,B2 2k TCR teond)) 
[BO] B es at LOOP yee For LOOP 
LDH .D1 *DR44++,A5 7** ali] 
[B1] ADD ~L1 A7,A5,A7 ; sum += ali] 
[!'B1]SUB -D1 A7,A5,A7 7; sum -= a[il] 
CMPEQ .L2 Bo,B2,Bl1 ;* (theta == !(! (cond))) 
SHL il: A6,1,A6 7;** mask = mask << 1; 
AND ~S2X B4,A6,B2 7** cond = codeword & mask 
; Branch occurs here 
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4.6.8 Performance Improvements 


You can improve the performance of the code in Example 4—29 if you know 
that the loop count is at least 3: 


Lj Remove the decrement counter instructions outside the loop. 


_j Put the MVK (for setting up the loop counter) in parallel with the first 
branch. 


Lj The first two branches are now unconditional, since the loop count is at 
least 3 and you know that the first two branches must execute. 


These two changes save two cycles at the beginning of the loop prologue. To 
account for the removal of the three decrement-loop-counter instructions, set 
the loop counter to 3 less than the actual number of times you want the loop 
to execute: in this case, 29 (32 — 3). 


Example 4—30 shows the improved loop with a cycle count of 68 (2 x 32 +4). 
Table 4-13 compares the performance of Example 4—29 and Example 4-30. 


Table 4-13. Comparison of If-Then-Else Code Examples 


Code Example Cycles Cycle Count 
Example 4-29 If-Then-Else Assembly (2 X 32)+6 70 
Example 4-30 If-Then-Else Assembly With Loop Count Greater Than 3 (2 x 32)+4 68 
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Example 4—30. If-Then-Else Assembly With Loop Count Greater Than 3 


B Pec LOOP ; for LOOP 
| LDH -D1 *A4++,A5 ; afi] 
| MVK 252 29,BO0 ; set up loop counter 
SHL ead! A6o,1,A6 ; mask = mask << 1; 
| AND sB2x B4,A6,B2 ; cond = codeword & mask 
{[B2] MVK OZ 1,B2 ; !(!' (cond) ) 
| B ~S1 LOOP ;* for LOOP 
| LDH Pabsil *A4++,A5 7* afi] 
CMPEQ .L2 B6,B2,Bl1 ; (theta == !(! (cond) )) 
| SHL Pronk Ao,1,A6 ;* mask = mask << 1; 
| AND ~S2X B4,A6,B2 ;* cond = codeword & mask 
| ZERO -L1 A7 yj; zero out accumulator 
LOOP: 
{BO] ADD ~L2 -1,B0,BO ; decrement counter 
[B2] MVK 82 dig tie ee 1 Cl Geond)) 
[BO] B Preval LOOP 7** for LOOP 
LDH -D1 *A4++,A5 pee ala] 
[B1] ADD ~L1 A7,A5,A7 ; sum += a[i] 
[!B1]SUB AaB yal A7,A5,A7 ; sum -= a[il] 
CMPEQ .L2 Bo,B2,Bl1 7* (theta == !(!(cond))) 
SHL Sak A6o,1,A6 7;** mask = mask << 1; 
AND WOZK B4,A6,B2 7;** cond = codeword & mask 
; Branch occurs here 
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4.7 Loop Unrolling 


When resources are not fully utilized, you can help improve performance by 
unrolling the loop. In Example 4—31, only nine instructions execute every two 
cycles. If you unroll the loop and analyze the new minimum iteration interval, 
you have room to add instructions. A minimum iteration interval of 3 provides 
a 25-percent improvement in throughput: three cycles to do two iterations, 
rather than the four cycles required in Example 4—30. 


4.7.1. Unrolled If-Then-Else C Code 


Example 4—31 shows the unrolled version of Example 4—30. 


Example 4—31. Unrolled If-Then-Else C Code 


{ 


int i,sum, cond; 


sum = 0; 


mask = mask << 1; 


mask = mask << 1; 


} 


return (sum); 


} 


int unrolled_if_then(short a[], 


for (i = 0; i < 32; it=2){ 
cond = codeword & mask; 


cond = codeword & mask; 


int codeword, int mask, short theta) 


if (theta == !(!(cond))) 
sum += ali]; 

else 
sum -= a[il]; 


if (theta == !(! (cond))) 
sum += ait |); 

else 
sum == alati |; 
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4.7.2 ’C62xx Instructions (Inner Loop) 


Example 4—32 shows the unrolled loop with 16 instructions and the possibility 
of achieving a loop with a minimum iteration interval of 3. 


Example 4-32. List of Symbolic Unrolled If-Then-Else Instructions 


AND codeword, maski, condi ; condi = codeword & maski 
[condi] MVK 1,condi ; !(! (condi) ) 
CMPEQ theta, condi,ifi ; (theta == !(! (condi) )) 
LDH xaptrt+t+,ai ; afi] 
[ifi] ADD sumi,ai,sumi ; sum += afi] 
[!ifi] SUB sumi, ai, sumi ; sum —-= a[il] 
SHL maski,1,maskitl ; maski+l = maski << 1; 
AND codeword,maskit+l,condit+tl; condit+l = codeword & maskitl 
[condit+1]MVK 1,conditl Lal teonaditl)y) 
CMPEQ theta, conditl,ifitl ; (theta == !(! (condi+l1))) 
LDH xaptrt+t+,aitl 7 alae! 
[ifitl] ADD sumit+l,ait+l,sumitl ; sum += a[itl] 
[ta ti+L)] SUS sumit+tl,ait+l,sumitl ; sum —-= a[itl] 
SHL maski+1l,1,maski ; maski = maskitl << 1; 
[centr] ADD =i, Cnir,cner ; decrement counter 
[ener] oB LOOP ; for LOOP 
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4.7.3 Dependency Graph 


Although there are numerous ways to split the dependency graph, the main 
goal is to achieve a minimum iteration interval of 3 and meet these conditions: 


[J You cannot have more than nine non-.M instructions on either side. 
[1 Only three non-.M instructions can execute per cycle. 


Figure 4—12 shows the dependency graph for the unrolled, if-then-else code. 
Nine instructions are on the A side, and seven instructions are on the B side. 


Figure 4—12. Dependency Graph of Unrolled If-Then-Else Code 


A side B side 


AND SHL AND 
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4.7.4 Minimum Iteration Interval 


With 16 instructions, the minimum iteration interval is at least 3 because a 
maximum of six instructions can be in parallel with the following allocation 
possibilities: 


_j LDH must be on a.D unit. 

[1 SHL, B, and MVK must be on a.S unit. 

[1 The ADDs and SUB can be ona.§, .L, or .D unit. 
[1 The AND can be ona.S or .L unit. 


From Table 4-14, you can see that no one resource is used more than three 
times so that the minimum iteration interval is still 3. 


Checking the total number of non-.M instructions on each side shows that a 
total of nine instructions can be performed with the minimum iteration interval 
of 3. Since only seven non-.M instructions are on the B side, the minimum itera- 
tion interval is still 3. 


Table 4-14. Resource Table for Unrolled If-Then-Else Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit | Unit(s) Instructions Total/Unit 
M1 0 | M2 0 

S1 MVK and 2 SHLs 3 .S2 MVK and B 2 

-D1 2 LDHs 2 .L2 CMPEQ 1 

-L1 CMPEQ 1 .L2 pr.S2 AND 1 

-L1 or .S1 AND 1 .L2 ,.S2, or .D2 SUB and 2 ADDs 3 
.L1,.S1,or.D1 ADD and SUB 2 

Total non-.M units 9 Total non-.M units 7 
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4.7.5 Allocating Resources 


Now that the graph is split and you know the minimum iteration interval, you 
can allocate functional units and registers to the instructions. You must ensure 
no resource is used more than three times. 


Example 4-33 shows the ’C62xx instructions with the functional units. 


Example 4—33. Unrolled If-Then-Else Instructions 


AND .L1X B4,A6,A2 ; condi = codeword & maski 
{[A2] MVK «ol 1,A2 ; !(! (condi) ) 
CMPEQ .L1X B6,A2,Al1 ; (theta == !(! (condi))) 
LDH D1 *A44++,A5 ; afi] 
[Al] ADD ~L1 A7,A5,A7 ; sum += a[il] 
{!Al] SUB .D1 A7,A5,A7 ; sum -= al[il] 
SHL Agowle A6,1,A6 ; maski+l = maski << 1; 
AND ~L2X B4,A6,B2 ; condit+tl = codeword & maski+tl 
B2] MVK S2 1,B2 ; !(!(conditl) ) 
CMPEQ .L2 B6,B2,Bl ; (theta == !(! (condit+l))) 
LDH .D1 *R44++,B5 ; alitl 
B1] ADD LZ B7,B5,B7 ; sum += af[itl] 
!B1]SUB .D2 B7,B5,B7 ; sum —-= a[it+l] 
SHL .S1 A6,1,A6 ; maski = maskit+l << 1; 
BO] ADD D2 -1,B0,BO ; decrement counter 
BO] B 262 LOOP } or LOOP 


4.7.6 Final Assembly 


Example 4—34 shows the final assembly code after software pipelining. The 
cycle count of this loop is now 53: (8x 16) + 5. 


Code Example Cycles Cycle Count 
Example 4-29 If-Then-Else Assembly (2 X 32)+6 70 
Example 4-30 If-Then-Else Assembly With Loop Count Greater Than 3 (2 xX 32)+4 68 
Example 4-33 Unrolled lf-Then-Else Instructions (3 x 16)+5 53 
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Example 4-34. Unrolled If-Then-Else Assembly 


[A2] 


[B2] 


Al 


A2 


Bl 


B2 


AL 


'B1 


MVK 82 
LDH Didi 
ADD .D2 
LDH Del 
B -S2 
ADD .D2 
SHL 2S 
A L1X 
MVK ool 
AND L2X 
ZERO 

MVK “SZ 
CMPEQ .L1X 
SHL Sl 
LDH -D1 
ZERO 12 
CMPEQ 12 
ADD .D2 
LDH DL 
B -S2 
SHL roe | 
AND L1X 
ADD ~L1 
] SUB -D1 
MVK 8c, 
AND L2X 
ADD .L2 
] SUB -D2 
MVK woz 
CMPEQ .L1X 
SHL 3Gr1 
LDH -D1 
; Branch 
ADD .L1X 


16,BO 


*A4++,A5 
-1,B0,B0 


Ao,1,A6 
B4,A6,A2 


1,A2 
B4,A6,B2 
Al 


1,B2 
B6,A2,A1 
A6,1,A6 
*A4++,A5 
B7 


B6,B2,Bl1 
-1,B0,BO 
*R4+4+,B5 
LOOP 
A6,1,A6 
B4,A6,A2 


A7,A5,A7 
A7,A5,A7 
1,A2 

B4,A6,B2 


BY ,BS;B7 
B7,B5,B7 
1,B2 

B6,A2,Al1 
A6,1,A6 
*A4+4+,A5 


occurs here 


A7,B7,A4 


’ 


’ 
’ 


’ 


, 


’ 


’ 


’ 


’ 


’ 


’ 


v 


set up loop counter 


a[il 
decrement counter 


a[itl] 

for LOOP 

decrement counter 
maskit+l = maski << 1; 
condi = codeword & maski 


!(! (condi) ) 
condit+tl = codeword & maski+l 
zero accumulator 


!(! (condi+1) ) 
(theta == !(! (condi) )) 
maski = maskit+l << 1; 


;* afil 


zero accumulator 


(theta == !(! (conditl1))) 
decrement counter 

* air] 

;* for LOOP 


;* maskit+l = maski << 1; 


* condi = codeword & maski 


sum += a[i] 


sum -= a[i] 
* I Ch ieandi)) 
* condit+tl = codeword & maskitl 


sum += a[itl] 
sum -= a[itl] 
* Pel Ceonei+L)) 
* (theta == !(! (condi) )) 


;* maski = maski+l << 1; 


** ali] 


move to return register 
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4.8 Live-Too-Long Issues 


When the result of a parent instruction is live longer than the minimum iteration 
interval of a loop, you have a live-too-long problem. Because each instruction 
executes every iteration interval cycle, the next iteration of that parent over- 
writes the register with a new value before the child can read it. Section 4.4.10, 
Resource Conflicts on page 4-35, solved this problem simply by moving the 
parent to a later cycle. 


4.8.1 C Code With Live-Too-Long Problem 


Example 4-35 shows C code with a live-too-long problem that cannot be 
solved by rescheduling the parent instruction. A split-join path in the depen- 
dency graph in Figure 4—13 causes this live-too-long problem. 


Example 4—35. Live-Too-Long C Code 


int live_long(short a[],short b[],short c, short d, short e) 

{ 

int i,sum0,suml1,sum,a0,a2,a3,b0,b2,b3; 

short al,bl; 

sum0 = 0; 

suml = 0; 

for (i=0; i<100; i++) { 
aQ = a[i] * oc; 
al = a0 >> 15; 
ag = ‘al. ad; 
a3 = a2 a0; 
sum0 += a3; 
bO = b[i] * c; 
bl = bO >> 15; 
b2 = bl * e; 
b3 = b2 + bO0; 
suml += b3; 
} 

sum = sum0 + suml; 

return (sum); 

} 


4.8.2 


°C62xx Instructions (Inner Loop) 


Live-Too-Long Issues 


Example 4-36 shows the symbolic instructions that execute the loop in 


Example 4-35. 


Example 4-36. List of Symbolic Live-Too-Long Instructions 


LDH *aptrt+t+,al ; load ai from memory 

LDH *pptrt+t+,bi ; load bi from memory 

PY ai,c,a0 ; aO =ai*e 

SHR a0,15,al ; al = a0 >> 15 

PY al,d,a2 7; a2=alr*ada 

ADD a2,a0,a3 ; a3 = a2 + ad 

ADD sum0,a3,sum0 ; sum0 += a3 

PY bi,c,b0 7, bO =.bi * ¢ 

SHR b0,15,b1 ; bl = bO >> 15 

PY b1,e,b2 ; b2 = bl * e 

ADD b2,b0,b3 ; b3 = b2 + bO 

ADD suml,b3,suml ; suml += b3 
[entr]SUB  cntr,1,cntr ; decrement loop counter 
[cntr]B LOOP ; branch to loop 
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4.8.3 Dependency Graph 


Figure 4-13 shows the dependency graph for the /ive-too-long code. This 
algorithm includes three separate and independent graphs. Two of the inde- 
pendent graphs have spiit-join paths from a0 to a3 and from b0 to b3. 


Figure 4—13. Live-Too-Long Code 


Split-join path 
Split-join path 


SUB 
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4.8.4 Minimum Iteration Interval 


Table 4-15 shows the functional unit resources for the loop. Based on the re- 
source usage, the minimum iteration interval is 2 for the following reasons: 


Li Nospecific resources are used more than twice, implying a minimum itera- 
tion interval of 2. 


Lj A total of five non-.M units on each side also implies a minimum iteration 
interval of 2, since three non-.M units can be used on a side during each 
cycle. 


Table 4-15. Resource Table For Live—Too-Long Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit Unit(s) Instructions Total/Unit 
-M1 MPY 1 -M2 MPY 1 

S1 B and SHR 2 .S2 SHR 1 

-D1 LDH 1 -D2 LDH 1 
.L1,.S1,or.D1  2ADDs 2 .L2,.S2,or.D2 2ADDs and SUB 3 
Total non-.M units 5 Total non-.M units 5 
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4.8.5 Split-Join-Path Problems 


The minimum iteration interval is determined by both resources and data de- 
pendency. A loop carry path determined the minimum iteration interval of the 
IIR filter in Section 4.5, Loop Carry Paths, on page 4-46. In this example, a live- 
too-long problem determines the minimum iteration interval. 


In Figure 4—13, the two split-join paths from a0 to a3 and from bOto b3 create 
the live-too-long problem. Because the ADD a3 instruction cannot be sched- 
uled until the SHR a7 and MPY a2instructions finish, a0 must be live for at least 
four cycles. For example: 


L1 IfMPY adis scheduled on cycle 5, then the earliest SHR a7 can be sched- 
uled is cycle 7. 


Lj The earliest MPY a2 can be scheduled is cycle 8. 


Lj The earliest ADD a3 can be scheduled is cycle 10. 


Because a0 is written at the end of cycle 6, it must be live from cycle 7 to 
cycle 10, or four cycles. Therefore, if the value must be live for four cycles, the 
minimum iteration interval must be at least 4. A minimum iteration interval of 
4 means that the loop executes at half the performance that it could, based on 
resources. 


One way to solve this problem is to unroll the loop, so that you are doing twice 
as much work in each iteration. After unrolling, the minimum iteration interval 
is 4, based on both the resources and the data dependencies of the split-join 
path. Although unrolling the loop allows you to achieve the highest possible 
loop throughput, unrolling the loop does increase the code size. 


4.8.6 Inserting Moves 


Another solution to the /ive-too-long problem is to break up the lifetime of a0 
and bOby inserting move instructions (MVs). The MV instruction breaks up the 
left path of the split-join path into two smaller pieces. 


4.8.7 New Dependency Graph 
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Figure 4—14 shows the new dependency graph with the MV instructions. Now 
the left paths of the split-join paths are broken into two pieces. Each value, a0 
and a0’, can be live for minimum iteration interval cycles. If MPY a0 is sched- 
uled on cycle 5 and ADD a3 is scheduled on cycle 10, you can achieve a mini- 
mum iteration interval of 2 as follows: 


Lj If MV a0’is scheduled on cycle 8, then a0 is live on cycles 7 and 8, and 
a0’ is live on cycles 9 and 10. 


[J Since no values are live more than two cycles, the minimum iteration inter- 
val for this graph is 2. 


Live-Too-Long Issues 


Figure 4—14. New Dependency Graph of Live-Too-Long Code 
LDH LDH 
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4.8.8 Allocating Resources 


Example 4-37 shows the ’C62xx instructions with the functional units as- 
signed. The choice of units for the ADDs and SUB is flexible and represents 
one of a number of possibilites. One goal is to ensure that no functional unit 
is used more than the minimum iteration interval, or two times. 


The two 2X paths and one 1X path are required because the values c, d, and 
e reside on the side opposite from the instruction that is reading them. If these 
values had created a bottleneck of resources and caused the minimum itera- 
tion interval to increase, c, d, and e could have been loaded into the opposite 
register file outside the loop to eliminate the cross path. 


Example 4-37. List of Actual Live-Too-Long Code Instructions 


LDH «DL 
LDH .D2 
PY: 
SHR S 
PY 
1V D 
ADD L 
ADD L 
PY 2X 
SHR S2 
PY .M2X 
1V .D2 
ADD L2 
ADD L2 
B2 SUB S2 
B2 B Sl 


*A4++,A0 ; load ai from memory 
*B4++,BO0 ; load bi from memory 
A0,A6,A3 ; aQ=ai*e 

A3,15,A5 ; al = a0 >> 15 

A5,B6,A7 , a2=alxad 

A3,A2 ; Save aO across iterations 
A7,A2,A9 ; a3 = a2 + al 

A1,A9,Al1 ; sum0 += a3 

BO,A6,B10 ; bO = bi * c 

B10,15,B5 ; bl = bO >> 15 

B5,A8,B7 ; b2 =bl *e 

B10,B8 yj; save bO across iterations 
B7,B8,B9 ; b3 = b2 + bO 

B1,B9,Bl1 ; suml += b3 

B2,1,B2 ; decrement loop counter 
LOOP ; branch to loop 
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4.8.9 Final Assembly With Move Instructions 


Example 4—38 shows the final assembly code after software pipelining. The 
performance of this loop is 212 cycles (2 x100 + 11 + 1). 


Example 4-38. Final Assembly With Move Instructions 


*A4++,A0 ; load ai from memory 
*B4++,B0 ; load bi from memory 


100,B2 ; set up loop counter 


*A4++,A0 ;* load ai from memory 
*B4++,BO0 ;* load bi from memory 


Al ; zero out accumulator 
Bl ; zero out accumulator 


7** load ai from memory 
;** load bi from memory 


B2,1 ; decrement loop counter 


AO,A6,A3 ; a0 =ai*ec 
BO, A6,B10 ; bO =bi*e 
*A4++,A0 7*** load ai from memory 
*B4++, BO 7*** load bi from memory 


B2,1,B2 ; decrement loop counter 
LOOP ; branch to loop 


A3,15,A5 peak 

B10,15,B5 7° bi 

AO,A6,A3 ;* ad 

BO,A6,B10 He viae o) 0 ta 

*A4++,A0 hoe eee ai from memory 
*B4++,BO PERERA bi from memory 


A5,B6,A7 ; * dl 

A3,A2 ; across iterations 
B5,A8,B7 7 b2 xe 

B10,B8 ; save bO across iterations 
B2,1,B2 ;* decrement loop counter 
LOOP ;* branch to loop 


A3,15,A5 ;* al = a0 >> 15 
B10,15,B5 ;* bl = bO >> 15 
AO,A6,A3 ee a0 = ai * © 
BO, A6,B10 p** BO. = Da -*e 
*A4++,A0 7***** load ai from memory 
*B4++, BO 7***** load bi from memory 
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Example 4-38. Final Assembly With Move Instructions (Continued) 


LOOP: 
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~ 


> 


DD Ll A7,A2,A9 
DD «2 B7,B8,B9 
PY -M1X A5,B6,A7 
iV -D1 A3,A2 
PY .M2X B5,A8,B7 
V .D2 B10,B8 
UB $2 B2,1,B2 
“ol LOOP 
DD .L1 Al1,A9,Al1 
DD “ng B1,B9,Bl 
HR -S1 A3,15,A5 
HR «$2 B10,15,B5 
PY -M1 AO,A6,A3 
PY: .M2X BO,A6,B10 
DH PB AE *A4++,A0 
DH .D2 *B4++,B0 
Branch occurs here 
DD .L1X Al,B1,A4 


7* aB= 
7* b3 = 
;* a2 = 
;* save 
;* b2 = 
;* save 


a2 
b2 
al 
a0 
bl 
b0 


+ a0 
+ b0 
# al 
across iterations 
xe 
across iterations 


7** decrement loop counter 
7** branch to loop 


; sum0 
; suml 
7** al 
pat Tod 
pure a0 
pare b0 


pReKee 


pRRRRKR 


+= a3 
+= b3 
a 


0 >> 15 


bO >> 15 


ai*c 
bat “ee 


load ai from memory 
load bi from memory 


sum0 + suml 
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4.9 Redundant Load Elimination 


4.9.1 


FIR C Code 


Filter algorithms typically read the same value from memory multiple times and 
are, therefore, prime candidates for optimization by eliminating redundant load 
instructions. Rather than perform a load operation each time a particular value 
is read, you can keep the value in a register and read the register multiple 
times. 


Example 4—39 shows C code for a simple FIR filter. There are two memory 
reads (x/i+j] and h/i/) for each multiply. Because the ’C62xx can perform only 
two LDHs per cycle, it seems, at first glance, that only one multiply-accumulate 
per cycle is possible. 


One way to optimize this situation is to perform LDWs instead of LDHs to read 
two data values at a time. Although using LDW works for the harray, the x array 
presents a different problem because the ’C62xx does not allow you to load 
values across a word boundary. For example: 


(1 On the first outer loop (j = 0), you can read the x-array elements (0 and 1, 
2 and 3, etc.) as long as elements 0 and 1 are aligned on a 4-byte word 
boundary. 


(J However, the second outer loop (/ = 7) requires reading x-array elements 
1 through 32. The LDW operation would have to load elements that are 
not word-aligned: 1 and 2, 3 and 4, etc. 


Example 4-39. FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 


int i, j, sum; 


for (j = 0; j < 100; jt+) f 
sum = 0; 
for (i = 0; i < 32; i++) 
sum += x[i + j] * h[il; 
y(j3] = sum >> 15; 


Optimizing Assembly Codes 4-77 


Redundant Load Elimination 


4.9.2 Redundant Loads 


In order to achieve two multiply-accumulates per cycle, you need to reduce the 
number of LDHs. Because successive outer loops read all the same h-array 
values and almost all of the same x-array values, you can eliminate the redun- 
dant loads by unrolling the inner and outer loops. 


For example, x/7]is needed for the first outer loop (x//+7]/ with j=0) and for the 
second outer loop (x/j] with j=7). You can use a single LDH instruction to load 
this value. 


4.9.3 New FIR C Code 


Example 4—40 shows that after eliminating redundant loads, there are four 
memory-read operations for every four multiply-accumulate operations. Now 
the memory accesses no longer limit the performance. 


Example 4—40. FIR Filter C Code With Redundant Load Elimination 
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void fir(short x[], short h[], short y[]) 
{ 


int i, j, sum0, suml; 
short x0,x1,h0,h1; 


for (j = 0; 3 < 100; jt=2) { 


sum0 = 0; 

suml = 0; 

x0 = x[Jl; 

for (i = 0; i < 32; it=2){ 
xe = se |[ gcratcril || 
ho = h[il; 
sum0 += xO * hO; 
suml += x1 * hO; 
x0 = se [[straisr 2] ¢ 
jad = inf aseil ]} e 
sum0 += x1 * hil; 
suml += x0 * hil; 
} 

yij] = sum0 >> 15; 

y[jtl] = suml >> 15; 


4.9.4 Symbolic ’C62xx Instructions (Inner Loop) 


Redundant Load Elimination 


Example 4—41 shows the ’C62xx instructions that perform the innermost loop. 


Element x0 is read by the MPY p00 before it is loaded by the LDH x0 instruc- 
tion; x/j] (the first x0) is loaded outside the loop, but successive even elements 
are loaded inside the loop. 


Example 4-41. List Of Symbolic FIR Instructions 


Bett [2 [p21 
*Ah++[2],h0 


x0,h0,p00 
x1,h0,p10 


p00, sum0, sum0 
pl0,suml1, suml1 


*Axt+[2],x0 
*Bxt+[2],h1 


x1,h1,p01 
x0,h1,pl1l1 


p01, sum0, sum0 
pll,suml, suml 


Cnr, 1, Cntr 


LOOP 


, Xl = x[j+itl1] 

; hO = hf[il 

fe xO! & AO 

pxt, Ad 

, sum0 += x0 * hO 
; suml += xl * hO 
; xO = x[j+it2] 

; hl = h[itl] 

3 xl AT 

poOP* TAL 

; sum0 += x1 * hil 
yy suml += x0 * hi 
; decrement loop counter 
; branch to loop 
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4.9.5 Dependency Graph 
Figure 4—15 shows the dependency graph of the FIR filter. 


Figure 4—15. Dependency Graph of FIR Filter With Redundant Load Elimination 


A side |B side 
pad LDH LDH 
OF (+) 
5 
MPY 


ine) (3) 
(3)5 
U 
ie) < 
ol 
= 
(s)s 
ING 
= 
U 
< 
_ / 
(3s a 
= 
< 


Ane ADD 
er) |e 
1 
es 
ADD 
Guna) ADD 1 


SUB 
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4.9.6 Minimum Iteration Interval 


Table 4—16 shows that the minimum iteration interval is 2. An iteration interval 
of 2 means that two multiply-accumulates are executing per cycle. 


Table 4-16. Resource Table for FIR Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit |] Unit(s) Instructions Total/Unit 
-M1 2 MPYs 2 -M2 2 MPYs 2 

S1 0 .S2 B 1 

-D1 2 LDHs 2 -D2 2 LDHs 2 
.L1,.S1,or.D1 2ADDs 2 .L2, .S2, .D2 2 ADDs and SUB 3 
Total non-.M units 4 Total non-.M units 6 

1X paths 2 2X paths 2 


4.9.7 ’C62xx Instructions (Inner Loop) 


Example 4—42 shows the ’C62xx instructions with allocated resources. 


Example 4-42. List of Actual FIR Instructions 


LDH .D2 *B5++[2],Bl ; xl = x[jtitl] 

LDH .D1 *A54+4+[2],Al1 ; ho = h[i] 

MPY .M1 AO,Al1,A7 pox * ho 

MPY .M1X B1,A1,A8 ; xl * ho 

ADD did: A7,A9,A9 ; sum0O += x0 * hO 

ADD ~L2X A8,B9,B9 ; suml += x1 * hO 

LDH .D1 *A4++[2],A0 ; xO = x[Jj+it2] 

LDH .D2 *B4++[2],BO ; hl = h[itl] 

MP Y .M2 B1,BO0,B7 exe ad 

MPY .M2X AO,BO,B8 ; xO * hil 

ADD .L1X B7,A9,A9 ; sumO += x1 * hl 

ADD .L2 B8,B9,B9 ; suml += xO * hl 
[B2] SUB wO2 B2,1,B2 ; decrement loop counter 
[B2] B ~S2 LOOP ; branch to loop 
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4.9.8 Final Assembly 
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Example 4—43 shows the final assembly for the FIR without redundant load 
instructions. 


[1 Atthe end ofthe inner loop is a branch to OUTLOOP that executes the next 
outer loop. 


_j The outer loop counter is 50 because iterations j and j + 1 execute each 
time the inner loop is run. 


_j) The inner loop counter is 16 because iterations i and i + 1 execute each 
inner loop iteration. 


The cycle count for this nested loop is 2352 cycles: 50 (16 x 2+9+6)+2. 
Fifteen cycles are overhead for each outer loop: 


Lj Nine cycles execute the inner loop prologue. 
LJ Six cycles execute the branch to the outer loop. 


See Section 4.11, Software Pipelining the Outer Loop, for information on how 
to reduce this overhead. 


Redundant Load Elimination 


Example 4-43. FIR With Redundant Load Elimination 


OUTLOOP : 


[A2] 


[B2] 


[B2] 


[B2] 


[B2] 


[B2] 


RO 
RO 


& 
cay eee) 
ra 


E 
UO 
I 


50,A2 


80,A3 
82,B6 


*A4++[2],A0 
A4,2,B5 
B4,2,B4 
B4,0,A5 
16,B2 
A2,1,A2 


UA Arce [| ZI) 5 AAA 
BS araP || 2 I) 7 IL 
AQ 
B9 


*B4++[2],B0 
*D4++ 


N 
> 
o) 


NS arse (|Z) I] pZAil 
*B5S+4+ 


N 
Ww 
me 


B2,1,B2 
*Ba++ 
*BRA+H+ 


No 


,BO 


N 
> 
S 


LOOP 
INSP AP || 2 I) 7 AML 
Saar || ZI) 7 esl 


AO,A1,A7 
2, 1.2 
*B4++[2],BO 
*A4++[2],A0 


Bill , 1210), 7 
B1,A1,A8 
LOOP 
*A5++[2],Al 
*B5++[2],Bl 


AT NT 
AO,BO,B8 
AO,Al1,A7 
BD, 1, BP 
*B4++[2],B0 
*R4++[2],A0 


’ 


’ 


v 


v 


’ 


’ 


1 


set up outer loop counter 


used to rst x ptr outer loop 
used to rst h ptr outer loop 


x0 = x[3] @) 


set up pointer to x[jt1] 

set up pointer to h[1] 

set up pointer to h[0] 

set up inner loop counter 
decrement outer loop counter 


ho = h(i] @) 


xl = x[jtit+l 
zero out sum0 
2ero (out sum 


ina @B) 


520) S58 |[ arta 2 


Eagan @) 


es SCL =) Se || asrakar dl] 


decrement inner loop counter ©) 
pir Toil == Jeylfalapal] 


mm Se(0) Se |p aratse 2 


branch to inner loop ©) 
pete TalO) = ley (Lat) 
Gist Sil $2 [Laaralard | 


x10 In) @) 


* decrement inner loop counter 


gees Veil = Jey [atari] 
pete Se) Sef eas] 


xl * hl @) 


xi In) 


;* branch to inner loop 


pester Tei) =" Ini |pal] 
eee Gal eas] 


’ 


1 


@ 


;** decrement inner loop counter 


5) °° aul 
Se) Tail0) 


pervert Jol = Jatlfabsr il] 
pert x0) = xe tea] 
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Example 4—43 FIR With Redundant Load Elimination (Continued) 


LOOP: 
ADD .L2X 
ADD eal 
MPY .M2 
MPY -M1X 
[B2] B “o2 
LDH D 
LDH D2 
ADD L1X 
ADD Lid, 
MPY .M2X 
MPY -M1 
[B2] SUB 252 
LDH WZ 
LDH DEE 
[A2] B aiSll 
ll | SUB 6 dull 
| | SUB 5 IZ) 
SHR 5 
|| SHR SZ 
Sir D 
STH D 
NOP 2 


A8,B9,B9 
A7,A9,A9 
B1,B0,B7 
B1,A1,A8 
LOOP 
*A5++[2],Al 
*B5++[2],Bl 


B7,A9,A9 
Bo, BS, BY 
A0O,BO,B8 
AO, Al1,A7 
B2,1,B2 
*B4++[2],B0 
*R4++[2],A0 


OUTLOOP 
A4,A3,A4 
B4,B6,B4 


ANS) 2 AL'S) 7 ANS) 
BS), LS), 12g) 


AQ, *A6++ 


BO, *A6++ 


; suml += x1 * HO 

; sum0 += x0 * hO 

a ge) 

ee le AO. 

7** branch to inner loop 
;**** HO = h[il 


p**e* x1 = x[jtitl] 
; sum0O += x1 * Al 

; suml += xO * Al 
pon xO Se Ded 


7;** x0 * hO 

;*** decrement inner loop cntr 
p*e*** hl = h[itl] 
7**** XO = x[Jtit2] 


; inner loop branch occurs here 


; branch to outer loop 
; reset x pointer to x[j] 
; reset h pointer to h[0] 


e SO) ee LS 
8 stil Ss 15 


Pe Viz] = sum S> Is 
p wWijerll = sum S> 15 


; branch delay slots 


; outer loop branch occurs here 


Caecrt re 
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4.10 Memory Banks 


The internal memory of the ’'C62xx family varies from device to device. See 
the TMS320C62xx Peripherals Reference Guide to determine the memory 
spaces in your particular device. 


Most ’C62xx devices use an interleaved memory bank scheme, as shown in 
Figure 4—16. Each number in the diagram represents a byte address. A load 
byte (LDB) instruction from address 0 loads byte 0 in Bank 0. A load halfword 
(LDH) from address 0 loads the halfword value in bytes 0 and 1, which are also 
in Bank 0. An LDW from address 0 loads bytes 0 through 3 in Banks 0 and 1. 


Because each bank is single ported memory, only one access to each bank 
is allowed per cycle. Two accesses to a single bank in a given cycle result in 
amemory stall that halts all pipeline operation for one cycle, while the second 
value is read from memory. Two memory operations per cycle are allowed 
without any stall, as long as they do not access the same bank. 


For devices that have more than one memory space (Figure 4—17), an access 
to Bank 0 in one space does not interfere with an access to Bank 0 in another 
memory space, and no pipeline stall occurs. 


Figure 4-16. Four-Bank Interleaved Memory 
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Figure 4—17. Four-Bank Interleaved Memory With Two Memory Spaces 


Memory 
space 0 


Memory 
space 1 


Bank 0 Bank 1 Bank 2 Bank 3 


4.10.1 FIR Inner Loop 


Memory Banks 


Example 4—44 shows the inner loop from the final assembly in Example 4—43. 


_j) LDHs from the h array are in parallel with LDHs from the x array. 


L) If xf1] is on an even halfword (Bank 0) and h/0] is on an odd halfword 


(Bank 1), Example 4-44 has no memory hits. However, if both x/7] and 
h[O] are on an even halfword in memory (Bank 0) and they are in the same 
memory space, every cycle incurs a memory pipeline stall and the loop 


runs at half the speed. 


Example 4-44. Inner Loop of FIR 


LOOP: 
ADD L2X A8,B9,B9 
ADD ~L1 A7,A9,A9 
PY .M2 B1,B0,B7 
PX -M1X B1,A1,A8 
[B2] B OZ LOOP 
LDH -D1 *A5++[2],Al 
LDH .D2 *B5++[2],Bl 
ADD L1X B7,A9,A9 
ADD L2 B8,B9,B9 
MPY M2X AO,BO,B8 
MPY M1 AO,A1,A7 
[B2] SUB 52 B2,1,B2 
LDH .D2 *B4++[2],BO 
LDH -D1 *AR4++[2],A0 


** branch to inner loop 
KARA HO Shift] 
xeKK XT = x[jtitl] 


sum0 += xl * Al 


**e* H1 = h[itl] 
BAEK KO SK LZ] 


It is not always possible to fully control how arrays are aligned, especially if one 
of the arrays is passed into a function as a pointer and that pointer might have 
different alignments each time the function is called. One solution to this 
problem is to write a FIR filter that never has memory hits, regardless of the 


X and h array alignments. 


If accesses to the even and odd elements of an array (h or x) are scheduled 
onthe same cycle, the accesses are always on adjacent memory banks. Thus, 
to write an FIR filter that cannot ever have any memory hits, even and odd 
elements of the same array must be scheduled on the same loop cycle. 
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Inthe case of the FIR, scheduling the even and odd elements of the same array 
on the same loop cycle cannot be done in a two-cycle loop, as shown in 
Figure 4—18, which bases a valid loop on the following constraints: 


L] 
Lj 
Lj 


Lj 
Lj 


h0 and h7 are on the same loop cycle. 
x0 and x7 are on the same loop cycle. 


MPY p00 must be scheduled three or four cycles after LDH x0, since it 
must read x0 from the previous iteration of LDH x0. 


All MPYs must be five or six cycles after their LDH parents. 


No MPYs on the same side (A or B) can be on the same loop cycle. 


Figure 4-18 shows one scenario that almost works. All nodes satisfy the 
above constraints except MPY p70. Because one parent is on cycle 1 
(LDH hO) and another on cycle 0 (LDH x7), the only cycle for MPY p70 is 
cycle 6. However, another MPY on the A side is also scheduled on cycle 6 
(MPY p00). Other combinations of cycles for this graph produce similar 
results. 


Figure 4—18. FIR With Even and Odd Elements of Each Array on Same Loop Cycle 
A side B side 
LDH 


Note: Numbers in bold represent the cycle the instruction is scheduled on. 
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4.10.2 Unrolled FIR C Code 


The main limitation in solving the problem in Figure 4—18 is that you are trying 
to schedule a 2-cycle loop, which means that no values can be /ive more than 
two cycles. One solution is to increase the iteration interval to 3, but this 
decreases performance. Another way to solve this problem is to unroll the 
inner loop one more time and produce a 4-cycle loop. 


Example 4—45, shows the FIR C code after unrolling the inner loop one more 
time. This solution adds to the flexibility of scheduling and allows you to write 
FIR code that never has memory hits, regardless of array alignment and 
memory space. 


This solution is needed only when both the h and x arrays must reside in the 
same memory space; if each array resides in a separate memory space, the 
2-cycle loop in Example 4—40 is sufficient. 


Example 4-45. Unrolled FIR C Code 


{ 


for 


void fir(short x[], 


int iy J, 
short #0,x1,%2,%3;n0;hn1,n2,h3; 


short h[{], short yf[]) 


sum0, suml; 


(j = 0; j < 100; jt=2) { 


sum0 = 0; 


sum0 >> 15; 
y{jt1] = suml >> 15; 
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4.10.3 Unrolled ’C62xx Instructions For the Inner Loop of the FIR 


Example 4—46 shows a list of symbolic FIR instructions. 


Example 4—46. List of Symbolic Unrolled FIR Instructions 


LDH RES 50. ; Xl = x[Jjtit+l] 

LDH *ht++,ho0 ; ho = h[i] 

MPY x0,h0,p00 0S AO 

MPY x1,h0,p10 , xl * hO 

ADD p00, sum0, sum0 ; sum0O += x0 * hO 

ADD pl0,suml1, suml1 ; suml += x1 * hO 

LDH ket, x? 7, X¥2 = x[j+it2] 

LDH *ht++,hl ; hl = h[itl] 

MPY x1,h1,p01 e SL, oe 

MPY x2,h1,pll1 2 S20 AL 

ADD p01, sum0, sum0 : sum? += x1.-* bl 

ADD pli,suml, suml ; suml += x2 * Al 

LDH *x++,x3 7; X3 = x[Jt+it+3] 

LDH *ht++,h2 ; h2 = h[it2] 

MPY x2,h2,p02 RL Fe 

MPY x3,h2,p12 3 XS Ee 

ADD p02,sum0, sum0 ; sum0O += x2 * h2 

ADD pl2,suml1, suml1 ; suml += x3 * h2 

LDH *x++,x0 , xO = x[j+it4] 

LDH *ht++,h3 ; h3 = h[it3] 

MPY x3,h3,p03 ike FOS 

MPY x0,h3,p13 oO. SS 

ADD p03, sum0, sum0 ; sum0 += x3 * h3 

ADD pl3,suml1, suml ; suml += xO * h3 
[cntr] SUB entr,1,cntr ; decrement loop counter 
{entr] B LOOP ; branch to loop 
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Figure 4-19 shows the dependency graph of the FIR with no memory hits. 


Figure 4—19. Dependency Graph of FIR With No Memory Hits 


A side B side 
LDH LDH LDH LDH 
ho x1 h1 x2 
5 No |5 5 5 15 
MPY MPY, MPY 
poo p10 5 p01 pit 
MPY 2 
2 \ 
ADD LDH LDH LDH LDH 2 ADD 
sum0 x3 h2 x0 h3 sum1 
5 
L 5 Sy 5/5 
ADD. MPY. MPY MPY ADD 
MPY 
sum0 p12 pd2 p03 p13 sum1 
i 2 1 
2 1 
1\ ADD 2 ADD 
sum0 sum1 
2 
{ 1 
ADD ADD 
sum0 sum1 
SUB 
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4.10.5 Unrolled Symbolic ’C62xx Instructions With Functional Units for Inner Loop 


Example 4—47 shows the ’C62xx instructions with functional units assigned. 


L) 


Symbolic names now have an A or B in front of them to signify the register 
file where they reside. 


The actual register names are not chosen yet (as in previous examples) 
because they are allocated after scheduling the loop. 


Now there is a pointer to each array in both the A and B register files. 
Because each pointer references either the even elements or the odd 
elements, pointers are incremented by two halfwords instead of one as in 
Example 4-46. 


Example 4—47. List of Symbolic Unrolled FIR Instructions 


E 


Preset 


DH 
DH 
PY 
PY 
DD 
DD 


DH 
DH 
Pv, 
PY 
DD 
DD 


DH 
DH 
PY 
PY. 
DD 
DD 


DH 
DH 
Pay 
PY. 
DD 
DD 


[Bentr] S 
{Bentr] B 


UB 


-D1 
«D2 
.M1X 
M1 
.L1 
.L2X 


-S2 
“OZ 


*Ax++[2],Axl ; Xl = x[jtit+l] 
*Bht++[2],Ah0 ; ho = h[i] 
Bx0,Ah0,Ap0a kOe 
Ax1,Ah0,Apla ; xl * HO 

Ap0Oa, Asum0, Asum0 ; sum0 += x0 * hO 
Apla, Bsuml1, Bsum1 ; suml += xl * hO 
*Bxt++[2],Bx2 7; x2 = x[J+tit2] 
*Ahnt++[2],Bhl1 ; hl = h[itl] 
Ax1,Bh1,Bp0b po elo ® Chk 
Bx2,Bh1,Bplb ORL. Eh 

BpOb, Asum0, Asum0 . ‘sum0. += x1 * hi 
Bplb, Bsum1, Bsuml ? suml. += x2 * AL 
*Axt++[2],Ax3 ; x3 = x[j+it3] 
*Bht++[2],Ah2 ; h2 = h[it2] 
Bx2,Ah2,Ap0c eo KZ. 2 
Ax3,Ah2,Aplic SSS 

ApOc, Asum0, Asum0 ; sum0 += x2 * h2 
Aplic, Bsum1,Bsum1l ; suml += x3 * h2 
*Bxt++[2],Bx0 7; xO = x[Jjt+it+4] 
*Ahnt++[2],Bh3 ; h3 = h[it3] 
Ax3,Bh3,Bp0d xs) eos 
Bx0,Bh3,Bpld XO FS 

BpOd, Asum0, Asum0 ; sum0 += x3 * h3 
Bpld, Bsum1, Bsuml ; suml += xO * h3 
Bentr,1,Bentr ; decrement loop counter 
LOOP ; branch to loop 
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4.10.6 Register Allocation 


As the number of instructions in a loop increases, assigning a specific register 
to every value in the loop becomes increasingly difficult. If 33 instructions in 
a loop each write a value, they cannot all write to a unique register because 
the ’C62xx has only 32 registers. As a result, registers must share values that 
are not live on the same cycles in the loop. 


For example, in a 4-cycle loop: 


[1 Ifa value is written at the end of cycle 0 and read on cycle 2 of the loop, 
it is live for two cycles (cycles 1 and 2 of the loop). 


Lj Ifanother value is written at the end of cycle 2 and read on cycle 0 (the next 
iteration) of the loop, itis also live for two cycles (cycles 3 and 0 of the loop). 


Because both of these values are not live on the same cycles, they can 
occupy the same register. Only after scheduling these instructions and 
their children do you know that they can occupy the same register. 


Register allocation is not complicated but can be tedious when done by hand. 
Each value has to be analyzed for its lifetime and then appropriately combined 
with other values not live on the same cycles in the loop. The assembly opti- 
mizer handles this automatically after it software pipelines the loop. See 
TMS320C6x Optimizing C Compiler User’s Guide for more information. 
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4.10.7 Minimum Iteration Interval With No Memory Hits 


Based on Table 4—17, the minimum iteration interval should be 4. An iteration 
interval of 4 means that two multiply/accumulates still execute per cycle. 


Table 4-17. Resource Table for FIR Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit | Unit(s) Instructions Total/Unit 
.M1 4 MPYs 4 | .M2 4MPYs 4 

S1 0 .S2 B 1 

.D1 4 LDHs 4 .D2 4LDHs 4 
.L1,.S1,or.D1 4ADDs 4 .L2,.S2,or.D2 4ADDs and SUB 5 
Total non-.M units 8 | Total non-.M units 10 

1X paths 4 | 2X paths 4 


4.10.8 Final Assembly 


Example 4—48 shows the final assembly to the FIR with redundant load elimi- 
nation and no memory hits. At the end of the inner loop, there is a branch to 
OUTLOOP to execute the next outer loop. The outer loop counter is set to 50 
because iterations j & j+1 are executing each time the inner loop is run. The 
inner loop counter is set to 8 because iterations i, i+1, i+2, i+3 are executing 
each inner loop iteration. 


The cycle count for this nested loop is 50 (8 x 4+ 10+ 6) +2 = 2402 cycles. 
There is a rather large outer-loop overhead for executing the branch to the 
outer loop (6 cycles) and the inner loop prologue (10 cycles). Section 4.11 ad- 
dresses how to reduce this overhead by software pipelining the outer loop. 


Code Example Cycles Cycle Count 
Example 4—43 FIR With Redundant Load Elimination 50 (16 x 2+9+6)+2 2352 


Example 4—48 FIR With Redundant Load Elimination and No Memory 50 (8 x 4+ 10+6)+2 2402 
Hits 
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Example 4—48. FIR With Redundant Load Elimination and No Memory Hits 


OUTLOOP : 


[A2] 


[B2] 


[B2] 


MV. 


MV. 
MV. 


ns pew 


Hew & 


O 


RO 
RO 


ow 


Sl 


.L2X 


50,A2 


62,A3 
64,B10 


*RA++,B5 ; 
A4,4,B1 
B4,2,A8 

8} IBZ 
A2,1,A2 


Byars? [2 I) 7180) 
*A4++[2],A0 
AQ 


B9 

ENB arse [2 Il 7 BS 
seyetrar (LZ |] 7 2Nil 
*A4++[2],A5 
sigilarar |Z |] 7 BS) 
*B4++[2],A7 
INS ara || 2 I) EMS) 
BQ p Ik, Bz 
gsr |[ 2 I) 7180) 
*A4++[2],A0 
*A8++[2],B6 
iByalrap || 2 I) 7 ANdL 
B5,A1,A0 
AO, B6,B6 


*A44++(2],A5 
*B1++[2],B5 


LOOP 
BO,B6,B7 
AQ,A1,Al 
*B4++[2],A7 
*A8++[2],B8 
B2,1,B2 


A0,A9,A9 
A5,B8,B8 
BO,A7,A5 
*B1++[2],B0 
*A4++[2],A0 


v 
’ 


’ 


x0 = 


v 


’ 


1 


r 


’ 


ly = Io [faie2 
Ins} == Io | auarS} 
decrement loop counter 
CD ee SK || Fapalar Z| 
iL = Xe |f Jarabe dl j 
ee Jail = lar|fatsp i] 
Ja) =. Jot [Lab | 
x0 * hO @) 
iL © mil 
we OS = Se | SIarabar S| 


iets iia 
pies pil 


set up outer loop counter 


used to rst x pointer outloop 
used to rst h pointer outloop 


© 


set up pointer to x[j+t2] 

set up pointer to h[1] 

set up inner loop counter 
decrement outer loop counter 


x2 = x[jtit2] @) 
xl = x[jt+it+1] 

zero out sum0 

zero out suml 

nel — Satay [estes 

ho = h[i 


x3 = x[jtit3] 
0 = sei aera | 


| oo mo fe 


* x0 = x[j+it4] 


lgizeiacla © Moca 

AS In IL 

ell Tov) 

IZ, = In false | 

oe jan3} = lov |p aber 3] 

* decrement loop counter 


© 


sum0 += 
SeSh es Tats} 
SEA ee NZ) 
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Example 4—48 FIR With Redundant Load Elimination and No Memory Hits (Continued) 


inner loop branch occurs here 


LOOP: 
ADD ~L2X 
ADD .L1X 
MPY .M2 
MPY -M1 
[B2] LDH .D1 
[B2] LDH .D2 
ADD «LZ 
ADD -L1 
MPY -M1X 
MPY ~M2X 
[B2] LDH +Ddi 
[B2] LDH -D2 
ADD ~L2X 
ADD .L1X 
[B2] B .S1 
MPY .M2 
MPY .M1 
[B2] LDH -D2 
[B2] LDH -D1 
[B2] SUB 52 
ADD ~L2 
| ADD ~L1 
| MPY -M2X 
| MPY -M1X 
| | [B2] LDH uD2 
| | [B2] LDH .D1 
, 
[A2] B ES? 
| | SUB 5 Iiyil 
| | SUB 3 yz, 
Il | SUB S 
SHR S 
ll || SHR SZ 
Srl D 
poi ss D 
NOP 2) 
v 


outer loop branch occurs here 


Al,B9,B9 
Bo,A9,A9 
B5,B8,B7 
A5,A7,A7 
*A8++[2],B6 
*B4++[2],Al1 


B7,B9,B9 
A5,A9,A9 
B5,A1,A0 
A0,B6,B6 
*R4+4+[2],A5 
*B1++[2],B5 


A7,B9,B9 
B8,A9,A9 
LOOP 
BO,B6,B7 
AO,A1,A1 
*B4++[2],A7 
*A8++[2],B8 
B2,1,B2 


B7,B9,B9 
AO0,A9,A9 
A5,B8,B8 
BO,A7,A5 
*B1++[2],B0 
*A4++[2],A0 


OUTLOOP 
A4,A3,A4 
B4,B10,B4 
A9,A0,A9 


ING). ILD) p AY) 
IVS), ALS) 18S) 


AQ, *A6++ 


BO, *A6++ 


, 


, 


’ 


7** decrement loop counter 


’ 


, 


suml += xl 
sum0 += x1 
x0* * "HS 
x3 * AQ 


** hl = h[itl] 
Oh = ifs] 


suml += x2 
sum0 += x2 
* x0 * ho 


pe xs ee, 


* 
* h2 


ee XS x[j 


1+3] 


-** x0 x[G 


suml += x3 
sum0 += x3 
* branch to 
* x2 * hi 
fae <1 ern 10) 


1+4] 


* h2 
AealyS 
loop 


** h2 = h[it2] 


kk 3 = 


h[it+3] 


suml += x0 * h3 


;* sumd 


* x3 * 
* x? * 


SRE x2 
0 ee xl = 


+ 
h 
h 


= x0 * ho 
3 
2 
x[j+it2] 
x[j+itl] 


branch to outer loop 


Ng epo tall 
reset 
Swit Ole 


sum0 > 
suml > 


vial = 
wil jel] 


branch 


x 
h 


ee 
= 


pointer to x[j] 
pointer to h[0] 
x0*hO (eliminate add) 


ALS) 
LS) 


sum0 >> 15 
= suml >> 15 


delay slots 


@ 


Ce € © FO 
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4.11 Software Pipelining the Outer Loop 


In previous examples, software pipelining has always affected the inner loop. 
However, software pipelining works equally well with the outer loop in a nested 
loop. 


4.11.1 FIR C Code 


Example 4-49 shows the unrolled C code for the FIR in Example 4—45, 
Unrolled FIR C Code, on page 4-89. 


Example 4-49. Unrolled FIR C Code 


{ 


void fir(short x[], short h[], short y[]) 


int i, Jj, sum0, suml; 
short sO, xh; &2y%3)n0; bl, be, ho; 


for (j = 0; 43 < 100; jt=2) { 


xO = x[Jjt+it+4]; 
h3 = h[it3]; 
sum0 += x3 * h3; 
suml += xO * h3; 
} 

y[j] = sum0 >> 15; 

y{jt1] = suml >> 15; 
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4.11.2 Making the Outer Loop Parallel With the Inner Loop Epilogue and Prologue 


The final assembly in Example 4—48, FIR With Redundant Load Elimination 
and No Memory Hits, on page 4-95, contained 16 cycles of overhead to call 
the inner loop every time: 


Li Ten cycles for the loop prologue 
[1 Six cycles for the outer loop instructions and branching to the outer loop 


Most of this overhead can be reduced as follows: 


_j Put the outer loop and branch instructions in parallel with the prologue. 
[1 Create an epilogue to the inner loop. 
_j Put some outer loop instructions in parallel with the inner-loop epilogue. 


4.11.3 Final Assembly 


Example 4—50 shows the final assembly for the FIR with a software pipelined 
outer loop. Below the inner loop (starting on page 4-100), each instruction is 
marked in the comments with an e, p, or o for instructions relating to epilogue, 
prologue, or outer loop, respectively. 


The inner loop is now only run seven times, because the eighth iteration is 
done in the epilog in parallel with the prologue of the next inner loop and the 
outer loop instructions. 


The improved cycle count for this loop is 2006 cycles: 50 ((7 x4) +6 + 6) + 6. The 
outer-loop overhead for this loop has been reduced from 16 to 8 (6 + 6 — 4); 
the —4 represents one iteration less for the inner-loop iteration (seven instead 
of eight). 


Code Example Cycles Cycle Count 

Example 4—43 FIR With Redundant Load Elimination 50 (16 xX 2+9+6)+2 2352 

Example 4—48 FIR With Redundant Load Elimination and No Memory 50 (8 x 4+ 10+6)+2 2402 
Hits 

Example 4—50 FIR With Redundant Load Elimination and No Memory 50 (7 x 4+6+6)+6 2006 


Hits With Outer Loop Software Pipelined 
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Example 4—50. FIR With Redundant Load Elimination and No Memory Hits With Outer 
Loop Software Pipelined 


MVK ok 50,A2 ; set up outer loop counter 
STW -D2 Bi, ABlo == 7 push register 
MVK ork, 74,A3 ; used to rst x ptr outer loop 
MVK 302 72,B10 ; used to rst h ptr outer loop 
ADD ~L2X A6,2,Bl1 ; set up pointer to y[1] 
LDH -D1 *R4++,B8 ; x0 = x[J] 
ADD .L2X A4,4,B1 ; set up pointer to x[j+2] 
ADD hl B4,2,A8 ; set up pointer to h[1] 
MVK ~S2 8,B2 ; set up inner loop counter 
[A2] SUB ok A2,1,A2 ; decrement outer loop counter 
LDH .D2 *B1++[2],BO ; x2 = x[jt+it2] 
LDH -D1L *R4++[2],A0 ; Xl = x[jtit+l] 
ZERO Paral AY yj; zero out sum0d 
ZERO Pa bes BY ; zero out suml 
LDH JD *A8++[2],B6 ; hl = h[itl 
LDH *D2 *B4+4+[2],Al1 ; hO = hfi 
LDH .D1 *A4++[2],A5 ; x3 = x[j+it3] 
LDH .D2 *B1++[2],B5 ; xO = x[j+it4] 
OUTLOOP: 
LDH .D2 *B4++[2],A7 ; h2 = h[it2 
|| LDH sDI *A8++[2],B8 ; h3 = h[it3 
| | [B2] SUB sin B2,2,B2 ; decrement loop counter 
LDH .D2 *B1++[2],BO ok x2 x[Jj+it2] 
LDH .D1 *A4++[2],A0 pe 1 x[jt+itl] 
LDH -D1 *A8++[2],B6 7* hl = h[itl 
LDH .D2 *B44+4+[2],Al1 7* hO = Afi 
MPY .M1X B8,A1,A0 e. *Or* HO 
MPY .M2X A0,B6,B6 exe hd 
LDH eDA. *AR4++[2],A5 pine XS x[j+it+3] 
LDH .D2 *B1++[2],B5 Paun.4 0) x[j+it4] 
[B2] B .S1 LOOP ; branch to loop 
PY .M2 BO,B6,B7 ¢k2°-* “Ad 
PY -M1 AO,A1,Al1 jx oho 
LDH .D2 *B4++[2],A7 7* h2 = h[it2] 
LDH -D1 *A8++[2],B8 7* h3 = h[it3] 
[B2] SUB «S82 B2,1,B2 7* decrement loop counter 
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Example 4—50. FIR With Redundant Load Elimination and No Memory Hits With Outer 
Loop Software Pipelined (Continued) 


ADD -L1 AO,A9,A9 ; sumO += x0 * hO 
| MPY 2X A5,B8,B8 ; x3 * h3 
| MPY .M1X BO,A7,A5 PZ: eh] 
| LDH .D2 *B1++[2],BO SR. «ED x[Jj+it2] 
| LDH -D1 *A4++[2],A0 ee Re Se x[jt+itl] 
LOOP 
ADD ~L2X Al,B9,B9 ; suml += xl * hO 
ADD .L1xX B6,A9,A9 ; sum0O += x1 * Al 
MPY .M2 B5,B8,B7 7 KO: FAS 
MPY -M1 A5,A7,A7 sx ke JAD 
LDH -D1 *A8++[2],B6 7** hl = h[itl] 
LDH .D2 *B4++[2],Al 7** ho = h[il 
ADD ~L2 B7,B9,B9 ; suml += x2 * hl 
ADD ~L1 A5,A9,A9 ; sum0 += x2 * h2 
MPY .M1X B5,A1,A0 7* x0 * hO 
MPY -M2X A0O,B6,B6 ek ed om TL 
LDH -D1 *A4++[2],A5 eR. ED x [j+it+3] 
LDH -D2 *B1++[2],B5 pee 0 x[j+it4] 
ADD ~L2X A7,B9,B9 ; suml += x3 * h2 
ADD .L1X B8,A9,A9 , sum0 += x3 * h3 
[B2] B .S1 LOOP 7* branch to loop 
MPY .M2 BO,B6,B7 er 2 ATT 
MPY .M1 AO,A1,Al1 ee FTO 
LDH .D2 *B4++[2],A7 7** h2 = h[it2] 
LDH -D1 *A8++[2],B8 ;** h3 = h[it3] 
[B2] SUB ~S2 B2,1,B2 7** decrement loop counter 
ADD LZ B7,B9,B9 ; suml += x0 * h3 
ADD L1 AO,A9,A9 ;* sum0O += x0 * hO 
MPY -M2X A5,B8,B8 pe XS eS 
MPY -M1X BO,A7,A5 PR DR 2 
LDH .D2 *B1++[2],BO 7*** x2 = x[Jtit2] 
LDH 2D *A4++[2],A0 7*** xl = x[Jtitl] 
; inner loop branch occurs here 
ADD ~L2X Al1,B9,B9 7e suml += xl * ho 
1 | ADD .L1X B6,A9,A9 7e sum0 += xl * hl 
I | MPY .M2 B5,B8,B7 7e x0 * h3 
| | MPY -M1 A5,A7,A7 2e x3: * h2 
| SUB -D1 A4,A3,A4 7o reset x pointer to x[j] 
ie SUB «D2 B4,B10,B4 7o reset h pointer to h[0] 
| | [A2] B Si OUTLOOP 70 branch to outer loop 
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Example 4—50. FIR With Redundant Load Elimination and No Memory Hits With Outer 


Loop Software Pipelined (Continued) 


> 
w) 
AOUvaToyv 


C 
is) 
OBrrmoos 


[A2] SU 


d 


ERO 


ERO 


NN WDM W 
I 


; outer loop branch occurs here 


D2 
-L1 
-D1 
-L2X 
-S1X 
92 


~S2 
-D1 
D2 


-D1 
D2 
Sl 
-S2 


B7,B9,B9 
A5,A9,A9 
*R4++,B8 


A4,4,Bl 
B4,2,A8 
8,B2 


A7,B9,B9 
B8,A9,A9 
*B1++[2],BO 
*A4++[2],A0 


A2,1,A2 


A9, *A6+4 


[2] 


B9, *B1l 
A9 
B9 


++ [2] 


je 
je 
7P 
ra) 
Fae) 
FRO) 


je 
je 
7P 
7P 
70 


je 
je 
7P 
7P 


je 
7P 
7P 


je 
je 
70 
Fe) 


suml += x2 * hl 

sum0 += x2 * h2 

40S 191 

set up pointer to x[j+2] 
set up pointer to h[1] 
set up inner loop counter 
suml += x3 * h2 

sum0 += x3 * h3 

x2 x[jtit2] 


xl = x[jti+1] 


+= x0 * h3 
> 15 
hl = h[itl] 
ho = h[i 


oo 1S 
x3 x[jt+it3] 
x0 x[jtit4] 


y[j] = sum0 >> 15 
y[jt1] = suml >> 15 
zero out sum0 

zero out suml 
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4.12 Outer Loop Conditionally Executed With Inner Loop 


Software pipelining the outer loop improved the outer loop overhead in the 
previous example from 16 cycles to 8 cycles. Executing the outer loop condi- 
tionally and in parallel with the inner loop eliminates the overhead entirely. 


4.12.1 FIR C Code 


Example 4—51 shows the unrolled FIR C code used in the previous example. 


Example 4-51. Unrolled FIR C Code 


void fir(short x[], short h[], short y[]) 
{ 


int i, j, sum0, suml; 
short. 20, k1,s#2,) 23; h0,hl, nays 


for (j = 0; 3 < 100; jt=2) { 


sum0O = 0; 
suml = 0; 
x0 = x[Jl; 


for (i = 0; i < 32; i+=4){ 


sum0O += x3 * h3; 
suml += x0 * h3; 


y[j] = sum0 >> 15; 
y{j+1] = suml >> 15; 
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4.12.2 ’C62xx Instructions (Inner Loop) 


Example 4-52 shows a list of symbolic FIR instructions. 


Example 4-52. List of Symbolic Unrolled FIR Instructions 


LDH * see eT. ; XL = x[jtitl] 
LDH *ht++,ho ; ho = h[il] 

MPY x0,h0,p00 ; 20 * bo 

MPY x1,h0,p10 pool eho 

ADD p00, sum0, sum0 ; sum0 += x0 * hO 
ADD pl0,suml, suml ; suml += xl * hO 
LDH est XZ 7; X¥2 = x[j+it2] 
LDH *ht+t+,hl ; hl = h[itl] 

MPY x1,h1,p01 pose) aL 

MPY x2,h1,pll1 ese oe 

ADD p01, sum0, sum0 7; sum0 += xl * hil 
ADD pll,suml,suml ; suml += x2 * hl 
LDH *Ax++,x3 7, X3 = x[jt+it3] 
LDH *ht+t+,h2 ; h2 = h[it2] 

MPY x2,h2,p02 Po! ORD 

MPY x3,h2,p12 foes, hh 

ADD p02, sum0, sum0 ; sum0 += x2 * h2 
ADD pl2,suml,suml ; suml += x3 * h2 
LDH *x++,x0 ; xO = x[Jj+it4] 
LDH *ht++,h3 ; h3 = h[it3] 

MPY x3,h3,p03 ee AS 

MPY x0,h3,p13 $ xO) SAS, 

ADD p03, sum0, sum0 7; sum0 += x3 * h3 
ADD pl3,suml,suml ; suml += x0 * h3 
[cntr] SUB entr; 1, cner ; decrement loop counter 

[cntr] B LOOP ; branch to loop 
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4.12.3 ’C62xx Instructions (Outer Loop) 


Example 4—53 shows the instructions that execute all of the outer loop func- 
tions. All of these instructions are conditional on inner loop counters. Two 
different counters are needed, because they must decrement to 0 on different 
iterations. 


_j The resetting of the x and h pointers are conditional on the pointer reset 
counter, prc. 


_j The shifting and storing of the even and odd y elements is conditional on 
the store counter, sc. 


When these counters are 0, all of the instructions that are conditional on that 
value execute. 


(1 The MVK instruction resets the pointers to 8 because after every eight 
iterations of the loop, a new inner loop is completed (8 x 4 elements are 
processed). 


(1 The pointer reset counter becomes 0 first to reset the load pointers; then 
the store counter becomes 0 to shift and store the result. 


Example 4—53. List of Symbolic FIR Outer Loop Instructions 


SUB 
SHR 
SHR 


sc,l1l,sc ; dec store lp cntr 
sum0,15,y0 ; (sum0 >> 15) 

suml,15,yl1l ; (suml >> 15) 

yO, *yt++[2] 7 y[j] = (sum0 >> 15) 
yl,*yt+[2] 7 yljtl] = (suml >> 15) 
8,sSCc ; reset store lp cntr 

pre, l,pre dec pointer reset lp cntr 


x, rstx,xX 
xX, rstx,xX 


) 
h,rsth,h reset h ptr (A side) 
h, rsth,h reset h ptr (B side) 
8,pre reset pointer reset lp cntr 


reset x ptr (A side) 
reset x ptr (B side 


Ne Ne Ne Ne Ne Ne 
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The total number of instructions to execute both the inner and outer loops is 
38: 26 for the inner loop and 12 for the outer loop. A 4-cycle loop is no longer 
possible. To avoid slowing down the throughput of the inner loop to reduce the 
outer-loop overhead, you must unroll the FIR again. 


Example 4-54 shows the C code for the FIR, which operates on eight 
elements every inner loop. Two outer loops are also being processed together, 
as in Example 4-51. 


Outer Loop Conditionally Executed With Inner Loop 


Example 4-54. Unrolled FIR C Code 


void fir(short x[], short h[], short y[]) 
{ 
int i, j, sum0, suml; 
short x0,x1,x2,x3,x4,x5,x6,x7,h0,h1,h2,h3,h4,h5,h6,h7; 
for (j = 0; 3 < 100; jt=2) { 
sum0O = 0; 
suml = 0; 
x0 = x[Jjl; 
for (i = 0; i < 32; it=8){ 
x1 = x[jtitl]; 
ho = h[il; 
sum0 4 x0 * hO; 
suml += xl * hO; 
x2 = x[Jtit2]; 
hl = h[itl]; 
sum0 += xl * hil; 
suml x2 * hil 
x3 = x[Jjtit+3]; 
h2 = h[it2]; 
sum0 S20 “AZ 
suml += x3 * h2; 
x4 = x[Jjt+it+4]; 
h3 = h[it3]; 
sum0 += x3 * h3; 
suml += x4 * h3; 
x5 = x[Jjtit5]; 
h4 = h[it4]; 
sum0 4 x4 * h4; 
suml += x5 * h4; 
x6 = x[Jjtit+6]; 
h5 = h[it5]; 
sum0 4 x5 eon 5? 
suml += x6 * h5; 
x7 = x[Jjtit7]; 
h6 = h[it6]; 
sum0O x6 * bbs 
suml += x7 * h6; 
xO = x[Jjt+it+8]; 
h7 = h[it+7]; 
sum0 += x7 * h7; 
suml += xO * h7; 
} 
ytj] = sum0 >> 15; 
y{jt1] = suml >> 15; 
} 
} 
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4.12.5 ’C62xx Instructions (Inner Loop) 
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Example 4-55 shows the symbolic instructions that perform the inner and 
outer loops of the FIR: 


L) 
Lj 


LDWs are used instead of LDHs to reduce the number of loads in the loop. 
The reset pointer instructions immediately follow the LDW instructions. 


The first ADD instructions for sum0 and sum7 are conditional on the same 
value as the store counter, because when scis 0, the end of one inner loop 
has been reached and the first ADD, which adds the previous sum07 to 
pod, must not be executed. 


The first ADD for sum0 writes to the same register as the first MPY p00. 
The second ADD reads p00 and p01. At the beginning of each inner loop, 
the first ADD is not performed, so the second ADD correctly read the 
results of the first two MPYs (p07 and p00) and adds them together. For 
other iterations of the inner loop, the first ADD executes, and the second 
ADD sums the second MPY result (p07) with the running accumulator. 
The same is true for the first and second ADDs of sum17. 
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Example 4—55. List of Symbolic FIR Instructions 


*h++[2],h0O1 
*ht++[2],h23 
*ht++[2],h45 
*h++[2],h67 
*x++[2],x01 
AXP (ZY p23: 
*x++[2],x45 
*x++[2],x67 
*x, x8 
prce,l,pre 


X,rstx,xX 
X,rstx,xX 
Hh, xsth; A 
h,rsth,h 
8,pre 


sc,1,se 

sum0,15,y0 
suml1,15,yl 
yO, *y++[2] 
yl, *y++[2] 


hQ01,x01,p00 
p00, sum07, p00 


hQ1,x01,p01 
p01,p00, sum01 


h23,x23,p02 
p02, sum01, sum02 


h23,x23,p03 
p03, sum02, sum03 


h45,x45,p04 
p04, sum03, sum04 


h45,x45,p05 
p05, sum04, sum05 


h67,x67,p06 
p06, sum05, sum06 


h67,x67,p07 
p07, sum06, sum07 


hQ01,x01,p10 
pl0,suml17,p10 


h[itO] & h[itl 
h[it2] & h[it3 
h[it4] & h[it5 
h[it6] & h[it7 
x[Jj+it0] & x[Jj+it1] 
x[J+it2] & x[ j+it3] 
x[jtit4] & x[jt+it5] 
x[Jj+ité6] & x[Jj+it7] 
x[j+it8] 


dec pointer reset lp cntr 
reset x ptr (A side) 

reset x ptr (B side) 

reset h ptr (A side) 

reset h ptr (B side) 

reset pointer reset lp cntr 


dec store lp cntr 
(sum0 >> 15) 

(suml >> 15) 

y[j] = (sum0 >> 15) 
y[j+1] = (suml >> 15) 


p0OO = h[it0]*x[j+it+0 
sum0 (p00) = p00 + sum0 


pOl = h[itl]*x[j+itl 
sum0 += p01 
p02 = h[it2]*x[j+it2 
sum0 += p02 
p03 = h[it+3]*x[j+it+3 
sum0 += p03 
p04 = h[it4]*x[j+it+4 
sum0 += p04 


poS = h[it5)]*x[j+it5 


p06 = h[it6]*x[j+it+6 
sum0 += p06 


pO7 = h[it7]*x[j+it+7 
sum0 += p07 


plo = h[it0]*x[j+titl 
suml (p10) = p10 + suml 
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Example 4—55. List of Symbolic FIR Instructions (Continued) 


PY 


ADD 


ADD 


Pav 
ADD 


PY 
ADD 


[!sc] VK 


Lents] Sus 
bentel B 


PY] 


LH 


HL 


h01,x23,pl1l1 ; pll = h[itl1]*x[j+it2 
pl1,pl10,sumil ; suml += pll 

h23 x23 p12 , pl2 = h[it2]*x[j+i+3 
pl2,sumil1,suml12 ; suml += pl2 
h23,x45,p13 ; pl3 = h[it3]*x[j+it4 
pl3,suml12,suml13 ; suml += pl3 
n45,x45,p14 ; pl4 = h[it4]*x[j+it5 
pl4,sum13,suml14 ; suml += pl4 
n45,x67,p15 7; plS = h[itd]*x[j+it+6 
pl5,sum14,sum15 ; suml += pl5 
h67,x67,p16 ; pl6 = h[it6]*x[j+it7 
pl6,sum15, suml16 ; suml += pl6 
h67,x8,pl7 ; pl7 = h[it7]*x[j+its8 
pl7,sum16,suml17 ; suml += pl7 

8,SCc ; reset store lp cntr 
entr,.,entr ; decrement loop counter 
LOOP ; branch to loop 


4.12.6 ’C62xx Instructions (Inner Loop and Outer Loop) 
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Example 4—56 shows the FIR instructions with functional units allocated. Al- 
though this allocation is one of many possibilities, one goal is to keep the 1X 
and 2X paths to a minimum. Even with this goal, you have five 2X paths and 
seven 1X paths. 


One requirement that was assumed when the functional units were chosen, 
was that all the sum0 values reside on the same side (A in this case) and all 
the sum values reside on the other side (B). Since you are scheduling eight 
accumulates for both sum0 and sum17 in an 8-cycle loop, each ADD must be 
scheduled immediately following the previous ADD. Therefore, it is undesir- 
able for any sum0 ADDs to use the same functional units as sum? ADDs. 


One MV instruction was added to get x07 on the B side for the MPYLH p10 
instruction. 


Outer Loop Conditionally Executed With Inner Loop 


Example 4—56. List of Actual FIR Instructions 


LDW :DL *Aht++[2],Ah01 ; h[itO] & h[itl 
LDW .D2 *Bht++[2],Bh23 ; h[it2] & h[it3 
LDW .D1 *Ah++[2],Ah45 ; h[it4] & h[it5 
LDW .D2 *Bht++[2],Bh67 ; h[it6] & h[it+7 
LDW .D2 *Bx++[2],Ax01 , X[Jtit0] & x[Jjtitl] 
LDW Dd *Axt++[2],Bx23 ; x[Jtit2] & x[Jj+it+3] 
LDW -D2 *Bx++[2],Ax45 7 X[jtit4] & x[jt+it5] 
LDW -D1 *Ax++[2],Bx67 , X[Jtit6] & x[jt+it7] 
LDH .D2 *Bx, Ax8 ; x[jtit8] 
{A2] SUB sou! A2,1,A2 ; dec store lp cntr 
[!A2] SHR ,ol Asum07,15,Ay0 ; (Asum0 >> 15) 
[!A2] SHR xS2 Bsum17,15,Byl ; (Bsuml >> 15) 
[!A2] STH .D1 AyO, *Ayt++[2] > ylj] = (Asum0 >> 15) 
[!A2] STH .D2 Byl, *Byt++[2] > yljtl] = (Bsuml >> 15) 
PY M1 Ah0O1,Ax01,Ap00 ; pOO = h[it0]*x[j+i+0 
[A2] ADD «lal Ap00,Asum07,Ap00 7 sum0(p00) = p00 + sum0 
PYH M1 AhO1,Ax01,Ap01 ; pOl = h[itl]*x[j+itl 
ADD ~L1 Ap01,Ap00,Asum01 ; sum0 += pol 
PY .M2 Bh23,Bx23,Bp02 7; p02 = h[it2]*x[j+it2 
ADD .L1X Bp02,Asum01,Asum02 ; sum0 += p02 
PYH .M2 Bh23,Bx23,Bp03 + p03 = h[it3]*x[j+it+3 
ADD -L1xX Bp03,Asum02, Asum03 7; sum0 += p03 
PY M1 Ah45,Ax45,Ap04 ; p04 = h[it4]*x[j+i+4 
ADD 4 be Ap04, Asum03,Asum04 ; sum0 += p04 
PYH M1 Ah45,Ax45,Ap05 7 pOS = h[it5]*x[j+it+5 
ADD -L1 Ap05,Asum04,Asum05 ; sum0 += p05 
PY .M2 Bh67,Bx67,Bp06 ; pO6 = h[it6]*x[j+it6é 
ADD L1X Bp06,Asum05, Asum06 ; sum0 += p06 
PYH .M2 Bh67,Bx67,Bp07 ; pO? = h[it7]*x[Jj+it7 
ADD -L1X Bp07,Asum06,Asum07 ; sum0 += p07 
iV AX Ax01,Bx01 7 move to other reg file 
PYLH .M2X Ah01,Bx01,Bp10 ; plO = h[it0O]*x[j+it+l 
[A2] ADD .L2 Bp10,Bsum17,Bp10 ; suml(p10) = p10 + suml 
PYHL .M1X Ah0O1,Bx23,Ap11 ; pll = h[itl]*x[j+i+2 
ADD .L2X Ap11,Bp10,Bsum11 7 suml += pll 
PYLH .M2 Bh23,Bx23,Bp12 ; pl2 = h[it2]*x[j+it+3 
ADD ~L2 Bp1l2,Bsuml11,Bsum12 ; suml += pl2 
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Example 4—56. List of Actual FIR Instructions (Continued) 


'A2 


Al 
Al 
‘AL 
AL 
AL 
AL 


BO 
BO 


PYHL 
ADD 


PYLH 
ADD 


PYHL 
ADD 


PYLH 
ADD 


PYHL 
ADD 


VK 


ZSnNOnNnNWNN 
c 
NWwww iw 


.L2X 


1X 


~S2 
-S2 


Apl 


Ah45,Ax45,Ap14 
Bsuml 


Apl 


Bel 


3, 


6, 


Bsum] 


Bh23,Ax45,Ap13 
Bsuml 


Bx67,Bp15 
Bsuml 


Bh67,Bx67,Bpl6é 
15,Bsuml 


Bh67,Ax8,Ap17 


Bsum] 


Al,1,Al1 
Bx, Brstx, Bx 
Ax, Arstx,Ax 
Ah, Arsth,Ah 
Bh, Brsth,Bh 
4,Al 


BO,1,B0 
LOOP 


l6,Bsuml 


2,Bsuml 


3,Bsuml1 


4,Bsuml1 


reset 


= plé 


h[it7 
= pl7 


store 


Eid 


ia 


ax 


AS 


a4 


lp 


enry 


dec ptr reset lp cntr 


reset 
reset 
reset 
reset 
reset 


x ptr 
x ptr 
h ptr 
h ptr 


ptr rst lp cntr 


dec outer lp cntr 
Branch outer loop 
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4.12.7 Minimum Iteration Interval 
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Based on Table 4—18, the minimum iteration interval is 8. An iteration interval 
of 8 means that two multiply-accumulates per cycle are still executing. 


Table 4-18. Resource Table for FIR 


(a) A side (b) B side 

Unit(s) Total/Unit | Unit(s) Total/Unit 
M1 8 | M2 8 

S1 7 .S2 6 

-D1 5 .D2 6 

-L1 8 -L2 8 
Total non-.M units 20 | Total non-.M units 20 

1X paths 7 | 2X paths 5 


4.12.8 Final Assembly 


Example 4—57 shows the final assembly for the FIR with the outer loop condi- 
tionally executing in parallel with the inner loop. The cycle count of this code 
is 1612: 50 (8 x 4+0)+ 12. The overhead due to the outer loop has been 


completely eliminated. 


Code Example 
Example 4—40 FIR With Redundant Load Elimination 


Example 4—48 FIR With Redundant Load Elimination and No Memory 


Hits 


Example 4-50 FIR With Redundant Load Elimination and No Memory 


Hits With Outer Loop Software Pipelined 


Example 4-53 FIR With Redundant Load Elimination and No Memory 


Hits With Outer Loop Conditionally Executed With 


Inner Loop 


Cycles 
50 (16 x 2+9+6)+2 


50 (8 x 4+10+6)+2 


50(7 xX 4+6+6)+6 


50 (8 x 4+0)+12 


Optimizing Assembly Codes 


Cycle Count 


2352 
2402 


2006 


1612 


4-111 


Outer Loop Conditionally Executed With Inner Loop 


Example 4-57. Final Assembly for FIR 


[Al] 


[!Al1] 


[A2] 


[!A2] 


[A2] 


B4,A0 
B4,4,B2 
A4,Bl 
A4,4,A4 
200,B0O 


*A4++[2],B9 
*B1++[2],A10 
4,Al 


*B2++[2],B7 
*AO++[2],A8 
60,A3 
60,B14 


*B1++[2],A11 
*A4++[2],B10 
Al,1,Al 
64,A5 

64,B5 
A6o,2,B6 


*A0++[2],A9 
*B2++[2],B8 
A4,A3,A4 


B1,B14,Bl 
AO, A5,A0 
*B1,A8 


Al10,0,B8 
5,A2 


A8,B8,B4 
B2,B5,B2 
A8,B9,A14 


A8,A10,A7 
B7,B9,B13 
A2,1,A2 
Bll 


B11,15,Bl11 


Y 
, 


, 


7* x[Jjtit2] 
7* x[j+ito] 


, 


point to h[0] & h[1] 

point to h[2] & h[3] 

point to x[j] & x[j+1] 
point to x[j+2] & x[j+3] 
set lp ctr ((32/8)*(100/2)) 


j+it3] 
j+itd] 


set pointer reset lp cntr 


h[it2 
h[i+0 


& h[id 
& hfit 


used to reset x ptr 
used to reset x ptr 


x[j+it4] 
x[j+it+6] 


+3 
+1 


& x[j+it5] 
& x[j+it7] 


(16*4-4) 
(16*4-4) 


dec pointer reset lp cntr 


used to reset h ptr 
used to reset h ptr 


point to y[jt1] 
h[it4] & h[it5] 
h[it6] & h[it7] 


reset x pre 


reset. xX ptr 
reset h ptr 
x[Jj+it+8] 


(16*4) 
(16*4) 


move to other reg file 


set store lp 


pl0 = h[it0]*x[jt+i4 


reset h ptr 


pll = h[itl]*x[j+i4 


p00 = h[it0]*x[j+i4 
p12 = h[it2]*x[j+i4 


dec store lp 


entre 


cntr 


zero out initial accumulator 


(Bsuml >> 15) 


p02 = h[it2]*x[j+i4 
p0l = h[itl]*x[j+i4 


suml (p10) = p10 + suml 


& x[j+it3] 
& x[jti 


zero out initial accumulator 
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Example 4—57. Final Assembly for FIR (Continued) 


LOOP: 
[!A2] SHR 7Sl A10,15,A12 ; (Asum0 >> 15) 
[BO] SUB .82 BO,1,BO ; dec outer lp cntr 
MP YH .M2 B7,B9,B13 ; pO3 = h[it3]*x[j+it+3] 
[A2] ADD -L1 A7,A10,A7 ; sum0(p00) = pOO + sum0d 
MPYHL .M1X B7,A11,A10 ; pl3 = h[it3]*x[j+i+4] 
ADD .L2X A14,B4,B7 ; suml += pll 
LDW «D2 *B2++[2],B7 7* h[it2] & h[it3] 
LDW yD1 *AQ++[2],A8 7;* h[itdO] & h[itl] 
ADD ~L1 A10,A7,A13 ; sum0 += pOl 
MPYHL .M2X A9,B10,B12 ; plS = h[it5]*x[j+i+6] 
MPYLH .M1 A9,A11,A10 ; pl4 = h[it4]*x[j+it5] 
ADD -L2 B13,B7,B7 ; suml += pl2 
LDW .D2 *B1++[2],A11 7* x[jtit4] & x[j+it5] 
LDW .D1 *A4++[2],B10 7* x[jt+it6] & x[jtit+7] 
[Al] SUB onl Al,1,Al1 ;* dec pointer reset lp cntr 
[BO] B .S2 LOOP ; Branch outer loop 
MPY .M1 A9,A11,A11 ; p04 = h[it4]*x[j+i+4] 
ADD .L1X B9,A13,A13 ; sum0 += p02 
MPYLH .M2 B8,B10,B13 ; pl6 = h[it6]*x[j+it7] 
ADD ~L2X A10,B7,B7 ; suml += pl3 
LDW .D1 *AQ++[2],A9 7* h[it4] & h[it5] 
LDW .D2 *B2++[2],B8 7;* h[it6] & h[it7] 
[!Al] SUB iol A4,A3,A4 ;* reset x ptr 
MPY .M2 B8,B10,Bl11 ; p06 = h[it6]*x[j+i+6] 
MP YH .M1 A9,A11,A11 ; pOS = h[it5]*x[j+it+5] 
ADD .L1X B13,A13,A9 ; sum0 += p03 
ADD .L2X A10,B7,B7 ; suml += pl4 
[!A1] SUB “S2 B1,Bl14,Bl ;* reset x ptr 
[!A1] SUB wo AO,A5,A0 ;* reset h ptr 
LDH .D2 *B1,A8 7* x[jt+it8] 
[!A2] MVK ost 4,A2 ; reset store lp cntr 
MP YH .M2 B8,B10,B13 ; pO7 = h[it7]*x[j+it+7] 
ADD L1 A11,A9,A9 ; sum0 += p04 
MPYHL 1x B8,A8,A9 ; pl7 = h[it7]*x[j+it+8] 
ADD S2 B12,B7,B10 ; suml += pl5 
[!A2] STH .D2 B11, *B6++[2] > yljt1] = (Bsuml >> 15) 
[!A2] STH D1 Al2,*A6++[2 ; ylj] = (Asum0 >> 15) 
ADD L2X A10,0,B8 ;* move to other reg file 
ADD ~L1 A11,A9,A12 ; sum0 += p05 
ADD ~L2 B13,B10,B8 ; suml += pl6 
MPYLH .M2X A8,B8,B4 7* plO = h[it0O]*x[j+it+1] 
[!'Al] MVK Aca 4,Al ;* reset pointer reset lp cntr 
[!Al] SUB soe B2,B5,B2 ;* reset h ptr 
MPYHL .M1X A8,B9,A14 7* pll = h[itl]*x[j+it+2] 
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Outer Loop Conditionally Executed With Inner Loop 


Example 4—57. Final Assembly for FIR (Continued) 


ADD wh2X A9,B8,Bl11 
ADD .L1X B11,A12,A12 
MPY .M1 A8,A10,A7 
MPYLH .M2 B7,B9,B13 
[A2] SUB .S1 A2,1,A2 
ADD -~L1X B13,A12,A10 
[!A2] SHR -S2 B11,15,Bl11 
MPY .M2 B7,B9,B9 
MPYH -M1 A8,A10,A10 
[A2] ADD £2 B4,B11,B4 
LDW -D1 *A4++[2],B9 
LDW D2 *B1l++[2],A10 
;Branch occurs here 
[!A2] SHR -S1 A10,15,A12 
[!A2] STH D2 Bll, *B6++[2] 
|| [!A2] STH -D1 Al12, *A6++[2] 


7 suml += pl7 

, sum0 += p06 

7* pOO = h[it0]*x[j+it+0] 
;* pl2 = h[it2]*x[jt+it+3] 
7* dec store lp cntr 


; sum0 += p07 

;* (Bsuml >> 15) 

7* p02 = h[it2]*x[ j+it2] 
7* pOl = h[itl]*x[jt+it1] 
7* suml(p10) = p10 + suml 
7** x[j+tit+2] & x[jtit+3] 
7** x[Jj+itO] & x[j+itl1] 


; (Asum0 >> 15) 


; y(jtl] = (Bsuml >> 15) 
; ylj]l = (Asum0 >> 15) 
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Chapter 5 


Applications Programming 


This chapter provides extensive code examples to supplement those found in 
Chapters 2-4. 
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5.1 Summary of Major Programming Methods 


The key to implementing applications on the ’C62xx is to take advantage of the 
processor’s full soeed. The main technique for achieving this goal involves un- 
rolling software loops to reach the limits of the functional units while meeting the 
data dependency constraints. 


In addition to loop unrolling, the following methods are helpful for improving 
performance: 


(1 Rearranging the C code 


If you are implementing a system based on an existing C code, rearranging 
the tasks in the C code is a useful method to gain better performance. 


(J Avoiding memory bank hits 


Memory bank hits, especially those in the inner loop in a nested loop 
application, hurt the performance dramatically and must be avoided. Most 
of the memory bank hits, however, can be eliminated by allocating the 
relevant arrays properly. Some situations, like accessing a word and a half- 
word in the same cycle, can also create the chance of a memory bank hit 
and should also be avoided. 


If the system implementation is quite complicated, the program-memory 
size becomes an issue. To achieve a good balance between program- 
memory size and speed, you can implement the less critical portions with 
highly-compact assembly code that sacrifices on performance. 
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5.2 Implementation of GSM EFR Vocoder 


This section presents the implementation of some representative pieces of the 
enhanced full rate (EFR) vocoder for the Global Systems for Mobile Commu- 
nications (GSM). 


Note: Copyright for C Code 


European Telecommunications Standards Institute (ETSI) has the copyright 
to all the C code used in this section. 


ee | 
The following are the global constants/symbols defined in EFR: 


#define Word16 — short 

#define Word32 _ int 

#define MAX_32  Ox/fffffffL 
#define MIN 32 Ox80000000L 
#define MAX_16 Ox/7fff 
#define MIN_16 0x8000 


5.2.1. Implementation of the Multiply-Accumulate Loop 


First, examine the most popular loop used in almost every fixed point vocoder, 
the multiply-accumulate (MAC) loop, shown in Example 5-1. 


Example 5-1. Typical MAC Loop C Code 


N; (typical value of N is an even integer, 


greater than or equal to 20) 


for (i=0;i<N;it++) sum=L_mac(sum,x[i],y[i]); 


input: 
*x, *y; 
result: 
Word32 sum; 
C Code 
where 


L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
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Example 5-2 shows a list of symbolic instructions for each iteration of the loop. 


Example 5-2. Symbolic Instructions of the MAC Loop 


LOOP: 


{entr] 
{entr] 


LDH 
LDH 
SMPY 
SADD 
SUB 
B 


-D *xptrt++, xi ; load x[i] 
-D *yptrt++, yi ; load y[i] 
M xi,yi,tmpo 7 smpy(x[il,y[i]) 
.L sum, tmp, sum ; sum=sadd(sum,smpy (x[iJ,y[il) 
.ALU entr,1,cntr ; decrement the loop counter 
, 


-S LOOP 


; branch to the loop 


In Example 5-2, xptrand ypir are the pointers for x and y, respectively. These 
instructions can easily fit into one execution packet, but two functional units are 
not used. 


In general, unrolling the loop once as in the code in Example 5-3, does not give 
the same result as the code shown in Example 5—1, because of the ordering 
dependence of the saturated addition. 


Example 5-3. MAC Loop C Code With Loop Unrolling 


Word32 sum_e, 


sum_o; 


sum_e=0; 
sum_o=0; 


for (i=0;i<N;it=2) { 
sum_e=L_mac(sum_e,x[i],y[i]); 
sum_o=L_mac(sum_o,x[it+1],y[itl]); 


} 


sum=L_add(sum_o, sum_e) ; 


where 


L_add(a,b)=_sadd (a,b) 


Both approaches lead to the same result if x/i/=y/i] for every /, since _smpy(x{[i], 
x[i] is always greater than or equal to 0. This special MAC loop is used to 
compute the energy of a particular signal segment. In this case, take the ap- 
proach shown in Example 5-3, since it doubles the performance of 
Example 5-1. Example 5-4 shows the C code for this special MAC loop. 
Example 5-5 lists the symbolic instructions for this loop, with one loop unrolling. 
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Example 5-4. Special MAC Loop C Code 


sum=0; 
for (i=0; i<N; i++) 
sum = L_mac(sum,x[i],x[i]); 


or 
sum_e=0; 
sum_o=0; 
for (i=0;i<N;it=2) { 
sum_e=L_mac(sum_e,x[i],x[i]); 
sum_o=L_mac (sum_o,x[it1],x[it1l]); 


} 


sum=L_add(sum_o, sum_e); 


Example 5-5. Special MAC Loop Symbolic Instructions 


LOOP: 
LDH -D *xptret+, Xi ; load x[i 
SMP Y -M xi,xi,tmp_e , smpy(x[i],x[i]) 
SADD -L sum_e,tmp_e, sum_e ; sum_e=sadd(sum_e, smpy(x[i],x[i]l) 
LDH .D *xptrott+, xitl ; load x[it+l] 
SMPY -M xitl,xi+1l,tmp_o ; smpy(x[itl],x[it+1]) 
SADD .L sum_o,tmp_o,sum_ o ; sum_o=sadd(sum_o, smpy (x[it1],x[i+1]) 
{cntr] SUB -S centr,2,cntr ; decrement the loop counter 
(entre): <B -S LOOP ; branch to the loop 
SADD .L sum_e, sum_o, sum ; sum=sadd(sum_ot+sum_e) 


In Example 5-5, xptre and xpiro are the pointers for x and point to x/O] and x/1], 
respectively, at the beginning. The eight instructions in the loop fit perfectly into 
one execution packet. Notice that this approach computes two MACs in one 
cycle. It doubles the performance of Example 5—2 for the general MAC loop. 


The final assembly codes are shown in Example 5-6. 
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Example 5-6. Final Assembly Code for the Energy Computation MAC Loop 


ke 
** 
** 
** 
** 
** 
** 
** 
ke 
** 
** 
** 
** 
kK 
ke 


KKK KKK KKK KKK KR KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK 


Texas Instruments, Inc ** 
kK* 

MAC Loop -—- Energy Computation me 
kK* 

Compute two samples a time med 
k* 

Total cycles = (N/2+2) ax 
kK* 

Register Usage: A B pa 
3 3 ae 

kK* 

Notice that x[0] and x[1] will not be available till LOOP me 
is executed once. Therefore, sum_e and sum_o should be Os ae, 
for the first three iterations. This is why A5, B5, A6, eX 
and B6 should be set to Os in the prelog. ae 


ADD .L2X 
SUB -D2 
B ~S2 
MVK OL 
LDH DAL 
LDH -D2 
B ~S2 
MV .L2X 
LDH SDA: 
LDH -D2 
B Sl 
MV -L1 
MV ~L2 
LDH -D1 
LDH -D2 
B SL 
LDH -D1 
LDH -D2 


KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK 


; A4 -—- &x[0] 

; B4 --N 

; A6 -—- sum 

A4,2,B4 7; &x[1] 

B4,6,Bl ; loop counter 

LOOP ; branch to the loop 
0,A6 ; initialize sum_e 
*AR4+4+[2],A5 ; load x[0] 
*B4++[2],B5 7 Load x1] 

LOOP ; branch to the loop 
A6,B6 ; initialize sum_o 
*AR4+4+[2],A5 ¢ Load x[2] 
*B44++[2],B5 - Load xT3] 

LOOP ; branch to the loop 
A6,A5 ; take care the initial three iterations 
B6,B5 ; take care the initial three iterations 
*A4+4+[2],A5 ; load x[4] 
*B4At+([2],B5 ; load x[5] 

LOOP 

*A4+4+[2],A5 ; load x[6] 

"Batt (2), B5 ; load x[7] 
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Example 5-6. Final Assembly Code for the Energy Computation MAC Loop(Continued) 


LOOP 
SMPY .M1 A5,A5,A 7; smpy(x[i],x[i]) 
| SMPY .M2 B5,B5,B7 ; smpy(x[itl],x[i+1]) 
| SADD -L1 A7,A6,A6 ; sum_e=sadd(sum_e, smpy (x[i],x[i])) 
| SADD ~L2 B7,B6,B6 ; sum_o=sadd(sum_o, smpy(x[i+1],x[it+1])) 
| LDH -D1 *A4+4+[2],A5 ; load x[i] 
| LDH D2 *B4++[2],B5 y load x [i411 
| [Bl] B eopik LOOP 7 branch to the loop 
| [Bl] SUB ose Bl, 2,81 ; decrement loop counter 
SADD -L1X A6,B6,A6 ; final result, sum = sum_e + sum_o 


5.2.2 Implementation of the Windowing and Scaling Part of autocorr.c 


autocorr.c is one of the top ten most computationally intensive modules for the 
EFR. The part used in Example 5—7 is used for windowing speech samples 
and for scaling down the windowed sample sequence if the input level is too 
high. 
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Example 5—7. C Code for Windowing and Scaling Part of autocorr.c 


#define L_WINDOW 240 


input: 

Word16 x[L_WINDOW], wind[L_WINDOW]; 
local variables/arrays: 

Wordl6 i; 

Wordl6 y[L_WINDOW]; 

Word32 sum; 

Wordl16 overfl, overfl_shft; 


Original C code: 


/* Windowing of signal */ 
for (i = 0; i < L_WINDOW; i++) 


yi] = mult_r (x[i], wind[il]); 
} 


/* Compute r[0] and test for overflow */ 


overfl_shft = 0; 


do 
{ 
overfl = 0; 
sum = OL; 
for (i = 0; i < L_WINDOW; i++) 
{ 
sum = L_mac (sum, y[i], ylil); 
} 
/* If overflow divide y[] by 4 */ 
if (L sub (sum, MAX_32) == OL) 
{ 
overfl_shft = add (overfl_shft, 4); 
overfl = 1; /* Set the overflow flag */ 
for (i = 0; i < L_WINDOW; i++) 
{ 
yli] = shr (y[il, 2); 
} 
} 
} 
while (overfl != 0); 
Where mult_r(a,b) = _sadd(_smpy (a,b) ,0x8000L) >>16 
L_mac(a,b,c)= _sadd(a,_smpy (b,c) ) 
L_sub(a,b) = _ssub (a,b) 
add(a,b) = ((_sadd((a)<<16, ((b) <<16)))>>16) 
shr(a,b) = ((b)<0O ? (_sshl((a), (-b+16))>>16) : ( (a) >>(b))) 
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Figure 5—1. Flow Diagram for Example 5-7 


for(i = 0; i < L_ WINDOW; i++) 
yi] = mult_r(x{i], windfi]) 


Loop 1 


Loop 2 


for(i = 0;i < L_WINDOW;; i++) 
sum = L_mac(sum, y{i], y[i]) 


L_sub (sum, MAX_32) == OL 
? 


Loop 3 


for(i = 0; i < L_WINDOW; i++) 
vii] = shr(yfi], 2) 


5.2.2.1 Unrolling the Loop 


Try the unrolling loop technique for each loop. 


Example 5-8 is the list of symbolic instructions needed to execute one iteration 
of Loop 1. You can use any arithmetic logic unit (ALU) for the loop-counter up- 
date. 


Example 5—8. Instructions to Execute One Iteration of Loop 1 


LOOP: 


[centr] 
[centr] 


LDH 
LDH 
SMPY 
SADD 
SHR 
STH 
SUB 
B 


nreunezoy 


LU 


*windptr++, windi ;load wind[i] 

*xptrt+, xi ;load x[i] 

windi, xi, windxi0 ;smpy (x[i],wind[i] ) 
windxi0,0x8000,windlxil ;sadd(smpy(x[i],wind[i]),0x8000L) 
windxil,16,yi ;sadd (smpy (x[i],wind[i]),0x8000L) >>16 
yi, *yptr++ ;store y[i] 

enix, 1, cntr ;decrement loop counter 

LOOP ;branch to loop 
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In Example 5-8, wind)ptr, xptr, and yptr are the pointers of wind, x, and y. 


The unit used most often (three times) is the .D unit (or the .S unit). With properly 


partitioned resources, this is a two-cycle loop. 


If you unroll the loop once and load both x and wind in words (in GSM EFR, 
both x and wind can be loaded in words if they are map-aligned with the word 
boundary), you can compute two ys with two cycles. The following is the new 


list of the instructions in a loop iteration. 


Example 5—9. New Instructions for Loop 1 


LOOP: 
LDW -D *windptr++,windi_winditl 
LDW -D *xptrt++,xi_xitl 
SMP Y -M windi_windit1, xi_xit+1,windxi0 
SMPYH .M windi_winditl, xi_xi+1l,windxi0+1 
SADD -L windxi0, 0x8000,windxil 
SADD -L windxi0+1,0x8000,windxiltl 
SHR -S windxil,16,yi 
SHR -S windxil+1,16,yitl 
STH -D yi, *yptret++[2] 
STH ‘BD yitl,*yptrot+[2] 
{[cntr] SUB 8 ICHE;.2, centr 
{cntr] B -S LOOP 


j;load wind[i] and wind[i+1] 
j;load x[i] and x[it+1l] 

;smpy (x[i],wind[i]) 

;smpy (x[it1],wind[i+1]) 

;sadd(smpy (x[i],wind[i]),0x8000L) 

; sadd(smpy (x[it+1],wind[it+1]),0x8000L) 
;sadd(smpy (x[i],wind[i]),0x8000L) >>16 
; sadd (smpy (x[ [ 

;store y[il] 

;store y[it+l] 

7decrement loop counter 

;branch to loop 


In Example 5-9, yptre and yptro are the pointers for yand point to y/O/and y/1], 


respectively, at the beginning. 


a | 


Note: 


Loop 2 is a special MAC loop, as described in subsection 5.2.1 on page 5-3. 
It can be implemented either as shown in Example 5—10 without loop unrolling 


or as in Example 5—11 with loop unrolling once. 


| 


Example 5—10. List of Instructions for Loop 2, No Loop Unrolling 


LDH D. *yptrtt+,yi 

SMPY .M yi,yli,yyl 

SADD .L sum, yyi, sum 
{cntr] SUB .S entry l,cntr 
{entr] B 5 LOOP 


;load y[i] 

;smpy (y[i],y[i]) 
;sadd(sum, smpy (y[i],y[i])) 
;decrement loop counter 
;branch to loop 
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Example 5—11. Instructions for Loop 2 With Loop Unrolling 


LOOP: 
LDH 
LDH 
SMPY 
SMPY 
SADD 
SADD 
[centr] SUB 
[cntr] B 


SADD 


*yptrett+,yi 
*yptrot+,yitl 
yi,yi,yyi 
yitl,yitl,yyitl 
sum_e, yyi,sum_e 


sum_o, yyitl,sum_o 


cniy) 7) Cntr 
LOOP 


sum_e, sum_o, sum 


;load y[i] 

;load y[itl] 

;smpy (y[i],yl[il]) 

;smpy (y[itl],y[itl]) 
( 


;sadd(sum_e, smpy (y[i],y[i]) 
;sadd(sum_o, smpy (y[it1],y[i+1])) 
;decrement loop counter 

;branch to loop 


; sum=sum_o+sum_e 


Later, you will see both approaches used in this application. 


Loop 3 is a single-cycle loop and you cannot speed it up by simply unrolling 
the loop. The instructions for each iteration are shown in Example 5-12. 


Example 5—12. Instructions for Loop 3 


LOOP: 
LDH 
SHR 
STH 
fonts] SUB 
[cntr] B 


HHUHnYy 


*yptrit+t+,yi 
yi,2,yi0 
yi0, *yptrst++ 
entixv, 1, cnte 
LOOP 


;load y[i] 

;shr(y[1i],2) 

;store y[i]=shr(y[i],2) 
;decrement loop counter 
;branch to loop 


In Example 5—12, ypirland ypirs are the pointer of array yfor loading and storing 


purposes, respectively. 


The new flow diagram is shown in Figure 5-2. 
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Figure 5-2. Unrolling the Loop 


for(i = 0; i < L_WINDOW; i+=2) { Loop 1 
y[i] = mult_r(x[i], wind[i]) 
y[i+1] = mult_r(x[i+1], wind[i+1])} 


for(i = 0; i < L_WINDOW,; i++) { Loop 2 
sum_o = L_mac(sum_o,yf[i], y[i]) 
sum_e = L_mac(sum_e,y[i+1], y[i+1]) 


} 
sum = sum_o+sum_e 


L_sub (sum, MAX_32) == OL 
2 


for(i = O;i<L_WINDOW; i++) Loop 3 
y[i] = shr(yfi], 2) 


5.2.2.2 Rearrange the C Code 


The first execution of Loop 2 can be combined with Loop 1 to form Loop | and 
its subsequent executions could be combined with Loop 3 to form Loop Il. 


One small change is the implementation of if L_sub(sum, MAX_32) == OL as 
sum == MAX_32. 


The new flow diagram with rearranged C code is shown in Figure 5-3. 


5-12 


Implementation of GSM EFR Vocoder 


Figure 5-3. Flow Diagram With Rearranged C Code 


for(i = 0; i < L_WINDOW; i+=2) { 
y[i] = mult_r(x{i], wind[i]) 
y[i+1] = mult_r(xfi+1], wind[i+1]) 
sum_o = L_mac(sum_o, y[i], y[i]) 
sum_e=L_mac(sum_e, y[i+1], y[i+1]) 


} 


sum = sum_oO + Sum_e 


Loop Il 
for(i = 0;i<L_WINDOW;; i++) { 


y[i] = shr(yfi], 2 
sum = L_mac(sum, y[i], y[i]) } 


You could use Loop | as one of the two approaches as shown in Example 5-13. 
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Example 5-13. 


Instructions for Loop | 


LOOP : 
LDW .D *windptrt++,windi_winditl j;load wind[i] and wind[i+1] 
LDW .D *xptrt++,xi_xitl j;load x[i] and x[i+l] 
SMPY -M windi_winditl, xi_xi+1l,windxi0 ;smpy (x[i],wind[i]) 
SMPYH .M windi_windi+1l, xi_xit+l,windxi0+1 ;smpy (x[it+1],wind[i+1]) 
SADD .L windxi0, 0x8000, windxil ;sadd(smpy (x[i],wind[i]),0x8000L) 
SADD -L windxi0+1,0x8000,windxiltl ;sadd(smpy (x[it1],wind[i+1]),0x8000L) 
SHR -S windxil,16,yi ;sadd(smpy (x[i],wind[i]),0x8000L) >>16 
SHR -S  windxil+1,16, yit+l ;sadd(smpy (x[it+1],wind[it+1]),0x8000L) >>16 
SMPY .M yi,yi,yyi ;smpy (y[il,y[il) 
SMPY -M yitl,yitl,yyitl ;smpy (y[it+1],y[it1]) 
SADD .L sum_e, yyi,sum_e ; sum_e=sadd(sum_e, smpy (y[i],y[il)) 
SADD .L sum_o,yyit+l,sum_o ; sum_o=sadd(sum_o, smpy (y[it+1],y[it+1]) 
STD -D yi,*yptret+[2] ;store y[i] 
STD -D yitl, *yptrot+[2] ;store y[it+l] 
[cntr] SUB 7S“ icntr;2,centr 7;decrement loop counter 
[cntr] B -S LOOP ;branch to loop 
or as 
LOOP: 
LDW -D *windptr++,windi_winditl j;load wind[i] and wind[it+1] 
LDW -D *xptrt++,xi_xitl j;load x[i] and x[i+1l] 
SMPY -M windi_windi+l, xi_xit+1,windxi0 ;smpy (x[i],wind[i]) 
SMPYH .M windi_windit+1, xi_xi+1,windxi0+1 ;smpy (x[it1],wind[i+1]) 
SADD -L windxi0, 0x8000,windxil ;sadd(smpy (x[i],wind[i]),0x8000L) 
SADD -L windxi0+1,0x8000,windxiltl ; sadd (smpy (x[i+1],wind[i+1]),0x8000L) 
SHR -S windxil,16,yi ;sadd(smpy (x[i],wind[i]),0x8000L) >>16 
SHR -S windxil+1,16,yitl ;sadd(smpy (x[i+1],wind[it+1]),0x8000L) >>16 
SMPYH .M windxil,windxil,yyi 7smpy (y[il,y[il]) 
SMPYH .M windxi+1,windxi+l, yyi+1 ;smpy (y[it1],y[it+1]) 
SADD -L sum_e, yyi, sum_e ;sum_e=sadd(sum_e, smpy (y[i],y[il])) 
SADD -L sum_o,yyit1l,sum_o ; sum_o=sadd(sum_o, smpy (y[it+1],y[i+1]) 
STD -D yi, *yptret+[2] ;store y[il] 
STD -D yitl, *yptrott+[2] ;store y[it+l1] 
{cntr] SUB 2S - Cnty, 2; cntr ;decrement loop counter 
{cntr] B -S LOOP ;branch to loop 


The only difference between these two implementations is how to compute 
sum_e and sum_o. Using sum_o as an example, the former approach 
computes sum_o following the order of the original C code: 


sum_o = _sadd(sum_o, _smpy (_sadd(_smpy(a,b),0x8000L)>>16, 


_sadd(_smpy (a,b) ,0x8000L) >>16)), 


The latter computes sum_o in a slightly different way as: 


sum_o = _sadd(sum_o, smpyh (_sadd ( 


_sadd(_smpy(a,b),0x8000))). 


smpy (a,b) ,0x8000), 
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This provides the flexibility to better pack the instructions and reduces cycle 
count. 


Loop lis a two-cycle loop. Loop Il is still a single-cycle loop. Its instructions are 
shown in Example 5-14. 


. Instructions for Loop I! 


LOOP: 
LDH 
SHR 
SMPY 
SADD 
STH 
[centr] SUB 
[cntr] B 


HAOr EB NYU 


*yptrlt+t+,yi ;load y[i] 


yi,2,yi0 ;shr(y[i],2) 

yi0,yi0,yyi ;smpy(shr(y[i],2),shr(y[i],2)) 

sum, yyi, sum ;sum=sadd (sum, smpy (shr(y[i],2),shr(y[i],2))) 
yi0, *yptrs++ ;store y[i]J=shr(y[i],2) 

centr, 1, centr ;decrement loop counter 

LOOP ;branch to loop 


5.2.2.3 Memory-Bank Hits 


To schedule Loop | as a two-cycle loop: 


L) Xfi] + xfi+ 1] << 16 and winod{i] + windfi + 1] << 16 must be loaded in the 
same cycle. 


L) yi] and y/i+1] must be stored in the other cycle. 
To avoid a memory-bank hit: 


_j Allocate x and wind in different memory spaces, if possible. For instance, 
allocate windfi]in data ROM and x in data RAM. 


L1 If no data ROM is available, allocate x and wind so they are offset from 
each other by one word. 


There is no memory-bank hit problem when storing y//] and y/i +7]. No 
memory-bank hits occur in Loop Il, since the distance between the load and 
store is always six halfwords. 


This part of autocorr.c C code is implemented in Example 5—15. 
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Example 5—15. Implemented C Code for autocorr.c 


Wordl6 i; 

Word1l6 y[L_WINDOW]; 

Word32 sum, sum_e, sum_o; 
Word1l6 overfl, overfl_shft; 


/* Windowing of signal */ 


sum_e=sum_o=O0L; 


for (i = 0; i < L_WINDOW; i+=2) 

{ 
yli] = mult_r (x[i], wind[i]); 
yfitl] = mult_r(x[itl], window[itl]); 
sum_e = L_mac(sum_e, yl[i], yl[il); 
sum_o = L_mac(sum_o, y[itl], y[it+l]); 


} 


sum=sum_et+sum_o; 


/* Compute r[0] and test for overflow */ 


overfl_shft = 0; 
do 
{ 
overfl = 0; 
/* If overflow divide y[] by 4 */ 
if (sum == MAX_32) 
{ 
overfl_shft = add (overfl_shft, 4); 
overfl = 1; /* Set the overflow flag */ 
sum=0L; 
for (i = 0; i < L_WINDOW; i++) 
{ 
yli] = shr (y[il, 2); 
sum = L_mac(sum, y[i], ylil); 
} 
} 
} 
while (overfl != 0); 


5.2.2.4 Code-Size Reduction 


Finally, consider the code-size reduction referred to in Figure 5-3 on page 5-13. 
Notice that Loop | is always executed and that Loop II is not executed, except 
for high-input levels. This means that cycle count is the most important factor 
for Loop |, while code size is more critical for Loop II. 
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The final assembly code is presented in Example 5-16. 


Example 5-16. Final Assembly Code for Windowing and Scaling of autocorr.c 


Kk 


Kk 


KEKKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK 


Texas Instruments, Inc 


Implementation of The Windowing and Scaling Part of autocorr.c ae, 


In EFR 
Compute two samples a time 


Total cycles = 257 (No Scaling) 
= 519 (One Scaling) 


Register Usage: A 
11 


KKKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKKKKKK 


; B4 -—- &x[0] 

; A4 -- &window[0] 

; A6 -- &y[0] 

; B8 -- L_WINDOW 

; AO -- sum and sum_e 
; BO -- sum_o 

; B15 -- stack pointer 


Kk 


K* 


Kk 


K* 


Kk 


K* 


Kk 


Kk 


K* 


Kk 


Kk 


K* 


’ 


notice that we use the latter approach in Example 5-13 to obtain the sum terms 


LDW -D2 
LDW -D1 
MVK eS 
SUB ~S2 
SUB L1x 
MVK S2 
LDW D2 
LDW D 
SHL S2 
MVK S 
ADD L2X 
MV L 
LDW D2 
LDW D 
MVKLH iS) 
MV L1xX 


*B4++,B5 
*R44++,A5 
480,A6 
B8,6,Bl1 


B15,A6,A6 
1,B7 


-1,A10 
A6,2,B6 
A6,A3 


*B4++,B5 
*A4++,A5 
32767,A10 
B7,A7 


load x[0] & x[1] 

load wind[0] & wind[1] 
reserve space for y[il] 
LOOP I counter 


&y [0] 
load x[2] & x[3] 
load wind[2] & wind[3] 


32768 or Ox8000L for rounding 


éy[1] 
&y [0 


load x[4] & x[5] 

load wind[4] & wind[5] 
7£f£ffLfLL = MAX_32 
32768 
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Example 5-16. Final Assembly Code for Windowing and Scaling of autocorr.c (Continued) 


SMPYH .M2X  BS5,A5,B2 
SMPY .MLX  B5,A5,A2 
B .S2 LOOP I 
LDW .D2 *B4++,B5 
LDW .DL *A4++,A5 
MVK .S1 0,A0 
MVK .S2 0,BO 
SMPYH .M2X  B5,A5,B2 
SMP Y .MLX  B5,A5,A2 
SADD L1 A2,A7,A2 
SADD .L2 B2,B7,B2 
B .S1 LOOP I 
LDW .D2 *B4++,B5 
LDW D1 *R4++,A5 
SHR Sl A2,16,A9 
SHR .S$2 B2,16,B9 
SMPYH .M1 A2,A2,A11 
SMPYH .M2 B2,B2,Bil 
LOOP I: 
STH D1 AQ, *A6++[2] 
STH D2 B9, *B6++[2] 
SADD L1 A2,A7,A2 
SADD S12 B2,B7,B2 
SMPYH .M2X  B5,A5,B2 
SMPY .MLX  B5,A5,A2 
[Bl] SUB $2 B1,2,Bl 
[Bl] B Sl LOOP I 
SADD L1 AO,A11,A0 
SADD L2 BO,B11,BO0 
SMP YH 1 A2,A2,A11 
SMP YH 2 B2,B2,Bil 
SHR Sl A2,16,A9 
SHR S2 B2,16,B9 
LDW D2 *B4++,B5 
LDW D1 *A4++,A5 
SADD L1X  AO,BO,A0 
MPY 2 BO,0,BO0 


smpy (x[1],wind[1]) 
smpy (x[0],wind[0]) 


load x[6] & x[7] 
load wind[6] & wind[7] 


sum_o = 0 

sum_e = 0 

smpy (x[3],wind[3]) 

smpy (x[2],wind[2]) 

sadd(smpy (x[1],wind[1]),0x8000L) 
sadd(smpy (x[0],wind[0]),0x8000L) 


load x[8] & x[9] 

load wind[8] & wind[9] 

y[1]=sadd(smpy (x[1],wind[1]),0x8000L) >>16 
y[0]=sadd(smpy (x[0],wind[0])+0x8000L) >>16 
smpy (y[0],y[0]) 

smpy (y[1],y[1]) 


store y[1] 

store y[0] 

sadd(smpy (x[3],wind[3]),0x8000L 
sadd(smpy (x[2],wind[2]),0x8000L 
smpy (x[5],wind[5]) 

smpy (x[4],wind[4]) 

decrement the loop counter 


sum_e += smpy(y[0],y[0]) 
sum_o += smpy(y[1],y[1]) 
smpy (y[2],y[2]) 
smpy (y[3],y[3]) 


x[3],wind[3]),0x8000L) >>16 
x[2],wind[2]),0x8000) >>16 
11] 

wind[11] 


y[2]=sadd(smp 
load x[10] & 
load wind[10] 


] 

[3] 
y[3]=sadd(smpy ( 
mpy ( 

x[ 

& 
sum = sum_e + sum_o 


overfl_shift = 0 
LOOP I completed 
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Example 5—16. Final Assembly Code for Windowing and Scaling of autocorr.c (Continued) 


LTEST: 


DDD 
PRP R 


PERE EEE 


> 


2 


> 


a 


FINISH: 


CMPE 


SA 


NO. 


DD 


DD 


AO,A10,Al1 


FINISH 
*A3,B5 
A3,2,B9 
BO, 4,B0 


*B9++,B5 
B8,7,Bl 
LOOP II 
A3,A9 
*B9++,B5 
0,A0 
LOOP II 


*B9++,B5 
LOOP II 
AOQ,A2 


*BO++,B5 
LOOP II 


*BO++, BS 
B5,2,A5 
LOOP II 


*B9++,B5 
B5,2,A5 

LOOP II 

A5, *A9++ 
B1L,-1,B1 
A5,A5,A2 
A2,A0,A0 


A5, *A9++ 
A5,A5,A2 
A2,A0,A0 
LTEST 


A2,A0,A0 


A2,A0,A0 


if (sum == MAX _ 32) 


No, exit 

load y[0] 

é&y[1] 

add (overfl_shift, 4) 


load y[1] 
counter for LOOPII 


load y[3] 
to take care of the initial condition 


load y[4] 


load y[5] 
y[0]J=shr(y[0],2) 


load y[6] 

y[1] = shr(y[1],2) 
branch 

store y[0] 

derement LOOPII counter 
smpy (y[0],y[0]) 

sum +=smpy (y[i],y[il) 


store y[n-1] 

smpy (y[n-1],y[n-1]) 

sum +=smpy (y[n-3],y[n-3]) 
branch back to LTEST 


sum t+t=smpy (y[n-2],y[n-2]) 


sum t+t=smpy (y[n-1],y[n-1]) 


save the code siz 
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If code size is not an issue, you can eliminate the last three NOPs by ex- 
panding the epilog of Loop II. This results in a 3-cycle count saving every 
time Loop II executes and two fetch packets (2 x 32 = 64 bytes) increase in 


code size. 


5.2.3. Implementation of cor_h 


cor_h is the second-most computationally intensive routine called to 
compute the matrix of autocorrelation, rr. The core part of cor_his presented 


in Example 5-17, in the code search module. 


Example 5-17. C Code cor_h 


#define L_CODE 40 


input: 
Wordl6 sign[L_CODE], h[L_CODE]; 


output: 
Word16 rr[L_CODE] [L_CODE]; 


local variables/arrays: 
Word16 h2[L_CODE]; /* function of h, the impulse response of weighted 
synthesis filter */ 
Wordl6 dec, 4, i, k; 
Word32 s; 


Original C code 


for (dec=1; dec<L_CODE; dect+) 


s = 0; 
3 = L_CODE-1; 
i = sub(j, dec); 


for (k=0; k<(L_CODE-dec); k++, i--, j--) 


= L_mac(s, h2[k], h2[k+dec]); 
rr[j] [i] = mult (round(s), mult(sign[i],sign[j])); 
reli) (go) = 25g) 12) 
} 
} 
where sub(a,b) = _ssub(a<<16, b<<16)>>16 
L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
mult(a,b) = _smpy(a,b)>>16 
and round(a) = _sadd(a,0x8000L) >>16 


The instructions to execute one iteration of the inner loop are listed i 


Example 5-18. 
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Example 5—18. Instructions to Execute One Inner Loop Iteration 


INNERLOOP : 


VoUHNkRVUHE RUD 


SUB 
B 


. ALU 


Ss 


*h2ptrt++,h2k ;load h2[k 

*h2decptrt++, h2deck ; load h2[k+dec] 

h2k, h2deck, h2kk ;smpy (h2[k],h2[k+dec] ) 

s,h2kk,s ;sadd(s,smpy (h2[k],h2[k+dec] ) 
s,0x8000L, sround ; round (s) <<16 

*signiptr--, signi ;load sign[i] 

*signjptr--, sign} ;load sign[j] 

signi, signj,signij ;smpy (Ssign[i],sign[j])=mult (sign[i],sign[j])<<16 
signij, sround, rrji0 ;L_mult (round(s),mult (sign[i],sign[j])) 
rrji0,16,rrji feria) TL) 

rrji, *rrjiptr—-[41] ;store rr[j] [i] 

rrji, *rrijptr—-[41] ;store rr[i] [3] 

Lontr Ll, tentr ;decrement inner loop counter 
INNERLOOP ; branch to inner loop 


In Example 5-18, h2ptr and h2decpir are the pointers for h2, pointing to h2/k] 
and h2[k+dec]. signiptrand signjptr are the pointers for sign, pointing to sign/i] 
and signfjj. rrjiptr and rrijpotr are the pointers for rr, pointing to rrfij, [i] and rr [i] 
[j], respectively. 


There is no need to find rouna(s) and mult(sign[i], sign{j]), since you have the 
SMPYH instruction. 


The .D unit (the unit used most often) is used six times in the inner loop. Ideally, 
these instructions can be arranged in three cycles. However, memory-bank 
hits occur with any combination of the load and/or store instruction. 
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Next, consider unrolling the inner loop once. The C code is shown in 
Example 5—19. 


Example 5-19. C Code With Inner Loop Unrolling 


for 


{ 


(dec=1; 


s = 
j 
i= 
for 


{ 


dec<L_CODE; dec+t+) 

O; 

L_CODE-1; 

sub(4j, dec); 

(k=0; k<(L_CODE-dec); kt+=2, i-=2, j-=2) 
s = Lmac(s, h2[k], h2[k+dec]); 
rer[j] [i] = mult (round(s), mult (sign[i]l,sign[j])); 
rrilal ial = eelgl iel¢y 
s = Lmac(s, h2[k+1], h2[k+1l+dec]); 
rr[j-1] [i-1] = mult (round(s), mult (sign[i-1],sign[j-1])); 
re[i-1][j-1] = rrf[j-1][i-1ll; 

if ((dec&1)!=0) { 

s = L_mac(s,h2[L_CODE-dec-1],h2[L_CODE-1]); 
rr[dec] [0] = mult (round(s),mult (sign[0],sign[dec])); 
rr[0] [dec] = rr[dec] [0]; 
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Notice that eight values must be loaded and four values must be stored in 
every iteration; however, h2/k] and h2[k+1] can be loaded in a word. The 
same is true for sign[j] and sign[j—1]. A total of six loadings are required. The 
inner loop instructions, then, are shown in Example 5-20. 
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Example 5-20. Inner Loop Instructions With Loop Unrolling 


INNER LOOP: 

LD .D *h2ptr+t+, h2k_h2k+1 ;load h2[k] and h2[k+1] 

LDH .D *h2decptr++, h2deck ;load h2[k+dec] 

SMPY .M h2k_h2k+1,h2deck, h2kk0 ;smpy (h2[k],h2[k+dec] ) 

SADD Li s,h2kk0,s ,sadd(s,smpy (h2[k],h2[k+dec] ) 

SADD <is s,0x8000L, sround j; round (s) <<16 

LDH .D *signiptr--, signi ;load sign[i] 

LDW -D *signjptr--, signj_signj-1 ;load sign[j] 

SMPYLH .M signi,signj_signj-1,signijO ;smpy(sign[i],sign[Jj]) 

SMP YH .M signij0, sround, rrji0 ;L_mult (round(s),mult (sign[i],sign[j])) 

SHR <2 rrji0,16,rrji ;rr[j) [i] 

STH BD rrji, *rrjiptr—-—- [82] ;store rr[j] [i] 

STH ra LE VL, “ELL jptr—=[382] ;store rr[i] [J] 

LDH Fe) *h2decptrt++,h2deck+1 ; load h2[k+1+dec] 

SMPYHL .M h2k_h2k+1,h2deck+1,h2kk1 ;smpy (h2 [k+1],h2[k+1+dec] ) 

SADD L s,h2kkl,s ;sadd(s, smpy (h2[k+1],h2[k+1+dec] ) 

SADD el s,0x8000L, sround ; round (s) <<16 

LDH .D *signiptr--,signi-1 ;load sign[i-1] 

SMPY -M signi-1,signj_signj-1,signijl;smpy (sign[i-1],sign[j-1]) 

SMP YH .M signijl,sround, rrjil ;L_mult (round(s),mult (sign[i-1],sign[j-1])} 

SHR Pes! rrjil,16,rrjlil por (gal) (a= 2) 

STH 7B rrjlil, *rrjlilptr—-[82] ;store rr[j-1] [i-1] 

ole oe rrjlil, *rriljlptr-—-[82] ;store rr[i-1] [j-1] 
[icntr]SUB.ALUicntr,2,icntr ;decrement inner loop counter 
[icntr]B .S INNERLOOP ; branch to loop 


To avoid memory-bank hits: 


.j Load words (h2[k], h2[k + 1]) and (sign[i— 1], sign[i]) together and allocate 
h2 and sign such that they are aligned with each other. 


LC) Store rrfjjfiJand rrfj— 1][i-— 1]together and rrfijfjjand rrfi— 1][j-1 ]together. 


There is a total of five loading/storing pairs, so that each iteration requires only 
five cycles. You gain speed by eliminating both the memory-bank hits, as well 
as by reducing the cycles required to complete each rr. 


The final assembly code with reduced code size is shown in Example 5-21. 
Here, the primitive technique introduced in Chapter 4 is used to reduce the 
code size for both the prelog and epilog of the inner loop. 
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Example 5-21. Final Assembly Code With Reduced Code Size 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


Texas Instruments, Inc 


Implementation of cor_h in EFR 


Compute four rrs at a time 


Total cycles 2933 


Register Usage: A 


16 


SUB ma A4,1,A13 ; 
|| ADDK .S1 76,A6 ; 
| | ADDK siZ 3360,B6 ; 
|| SUB .D1 A4,1,A2 ; 

MVK .S2 0,B2 ; 
| | ADD 12 B4,2,B13 ; 
|| MVK .S1 2,All ; 
OUTERLOOP : 

LDW .D1 *A6,A10 i 

LDW .D2 *B4,B12 ; 

ADD .L1X  B13,2,A3 ; 

SUB Psi A6,A11,A4 ; 

[A2] ADD .L2X  A2,2,B0 ; 

MPY .ML A13,A11,A3 

MPY .M2 B11,0,Bl11l ; 

LDH .D2 *B13++[2],A7 ; 

LDH .D1 *A3,B7 ; 

ADD .L2X  A4,2,B9 ; 

MV .S2 B6,B14 ; 

SUB tL A6,4,A8 ; 

[B2] ADDK .S1 -164,A14 ; 


B 


LS: 


A4 --- L_CODE 
B4 -—- &h2[0] 
A6 --- &sign[0] 
B6 --- &rr[0] [0] 


used to obtain é&rr[i] [Jj] 
&sign[L_CODE-2] 
&rxr[L_CODE-1] [L_CODE-2]+[82]=&rr[j] [i]+[82] 
outer loop counter 


and érr[i-1] [4-1] 


not doing the initial store 

&h2 [k+dec] 

used to increase/decrease the pointers 
for h2 and sign 


load sign[j-1] & sign[j] 
load h2[k] & h2[k+1] 

&h2 [k+dect1] 

&sign[i-1] 


define the inner loop counter 
initialize s 


k+dec] 


load h2[ 
[k+dect1] 


load h2 
&sign[i] 
érr[j] [i] +[82] 
&sign[j-3] 


from &rr[dec] [0]+[82] to é&rr[dec] [0] 


KEKKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKKKKKKKKKKKKKKK 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


Kk 


K* 


K* 


Kk 


K* 
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Example 5-21. Final Assembly Code With Reduced Code Size (Continued) 


B2] 
[B2] 


BO] 
[A2] 
[A2] 
[B2] 


[BO] 


npn w nnn iS > 
ey aan 5 
wmow WW DW = 


ag 


INNERLOOP : 


[!B1] 


[Al] 


[!A1] 
[!A1] 


LDH 
LDH 
SMPYHL 
SMP YLH 
ADDK 
ADDK 


M1 


2o1 
S2 


*B9,A0 
*R4--[2],B5 
B4,4,B8 
All1,2,A11 
-82,B14 

3, Al 


A12,*A14 
-164,A9 
B6,Al14 
BO,1,B0 
B14, A3,B3 
B6,2,B6 


INNERLOOP 
A2,1,A2 
A2,1,B2 
A12, *A9 
A14,A3,A9 
BO, BL 


B9,16,B10 
A3,A0,A3 
B11,A15,B9 


*A8—-,A10 
*B8++,B12 
OUTERLOOP 
B13,2,A3 


*B3,B7 
*B13++[2],A7 
B9,B5,B9 
A7,B12,A7 
B1,1,Bl 
Al,1,Al 
A4,2,B9 


*B9,A0 
*R4—-[2],B5 
A10,A0,A0 
B7,B12,B7 
-164,A14 
-164,B14 


, 


, 


, 


’ 
’ 


’ 


load sign[i] 

load sign[i-1] 

&h2[k+2] 

update All 

é&rr[j] [i] 

determine when the stores in the inner loop 
actually starts 


store rr[dec] [0] 

from &rr[0] [dec]+[82] tp &rr[0] [dec] 
érr[j] [i]+[82] 

inner loop counter 

grr[i-1] [3-1] 

érr[j][i-1], for the next 

outer loop iteration 


decrement outer loop counter 

decide if the last store is needed 
store rr[0] [dec] 

é&rrf[i] [j]+[82] 

counter for branching to outer loop 


## obtain rr[j-1] [i-1] 
# smpyh(sadd(s,0x8000L),smpy(sign[i],sign[j])) 
# sadd(s,0x8000L) 


*load sign[j] & sign[j-1] 
*load h2[k] & h2[k+1] 
outer LOOP 

&h2 [k+dect1] 


*h2 [k+dect1] 

*h2 [k+dec] 

# smpyh(sadd(s,0x8000L),smpy(sign[i-1],sign[j-1]) 

smpy (h2[k] ,h2 [k+dec] 

decrement the counter for branching to the outer loop 
decrement the inner loop 

&sign[i] 


*load sign[ 
*load sign[ 


i] 
aa 
ee ‘lige 
] 
[ 
[ 


1] 

sign[i]) 
smpy (h2[k+1],h2[k+1+dec]) 
## from é&rr[j] [i 
alt 


## from &rr 


ba [82] to &rr[j] [i] 


j 
j-1] {[i-1]+[82] to érr[j-1] [i-1] 
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Example 5-21. Final Assembly Code With Reduced Code Size (Continued) 


'Al 
'Al] 


'A2] 


INISH: 


STH 
STH 
SADD 


NOP 


«DL 
-D2 
-L1 


Al2,*A14 ; ## store rr[j] [il 
B10, *B14 ; ## store rr[j-1] [i-1] 

xX B11,A7,A5 7 S = sadd(s,smpy (h2[k],h2[k+dec] ) 

X A10,B5,B5 ; smpy (sign[i-1],sign[j-1] 
BO,1,B0 ; decrement inner loop counter 
-164,A9 ; ## from &rr[i][j]+[82] to rr[i][j] 
-164,B3 ; ## from &rr[i-1][j-1]+[82] to &rr[i-1] [j-1] 
Al12,*A9 ; ## store rr[i] [jl 
B10, *B3 ; ## store rr[j-1] [i-1] 
A3,16,A12 ; obtain rr[j-1] [i-1] 
A5,A15,A3 ; sadd(s,0x8000L) 

xX A5,B7,B11 7 S = sadd(s,smpy (h2[k+1],h2[k+dect+1] 
INNERLOOP , end of INNERLOOP 

xX B4,A11,B13 ; &h2[k+dec] 
FINISH , exit 
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The value of sis represented by both B11 and A5 to avoid two .L1 or two .L2 
units occurring in the same execute packet. Due to the dependence on s as 
well as the removal of memory-bank hits, it takes 20 cycles for each iteration 
of the modified C code. The number of #s denotes that this instruction is not 
executed or that the result of this instruction is not useful until # number of 
iterations: each time the outer loop enters the inner loop. 


The code size is 11 fetch packets (352 bytes). Without applying the primitive 
technique, the code size will be at least four fetch packets more than the code 
shown in Example 5—21. 


You can squeeze the instruction 
ADD .L2X B4,A11,B13 ; &h2[k+dec] 


into the inner loop to save about 1.5% of the cycle counts, with a one fetch 
packet increase in program memory. 
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Implementation of the rrv Computation In search_10i40 


Example 5-22 shows the implementation of the rrv computation in search_10i40. 


Example 5-22. C Code of the rrv Computation In Search_10i40 


#define 
#define 
#define 
#define 


L_CODE 40 
STEP 5 


2138 


input: 


_1_16 (Word16) (32768L/16) 
(Word16) (32768L/8) 


Word16 rr[L_CODE 


local variables/arrays: 


[L_CODI!I 


Word1l6 rrv[L_CODE]; 


Word16 10,11,12,13,14,15,i6,17,18,19; 


Word32 s; 


E], 


ipos [L_CODE]; 


/* defined on [0,L_CODE-1] */ 


(The values of i0, i1, i2, i3, i4, i5, i6, and i7 were obtained before entering this loop.) 


Original C code 


for (19 = ipos[9]; i9 < L_CODE; i9 += STEP) 
{ 
s = Lmult (rr[i9][i9], _1_16); 
s = Lmac (s, rr[i0] [i9 | 8); 
s = Lmac (s, rr[il] [i9 1 8); 
s = Lmac (s, rr[i2][i9 1 8); 
s = Lmac (s, rr[i3][i9 Lb 8).4 
s = Lmac (s, rr[i4] [i9 | 8); 
s = Lmac (s, rr[i5d5] [i9 | By s 
s = Lmac (s, rr[i6é] [i9 | 8); 
s = Lmac (s, rr[i7] [i9 | 8); 
rrv[i9] = round (s); 
} 
where L_mult(a,b) = _smpy (a,b) 
L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
and round(a) = _sadd(a,0x8000L) >>16 


The instructions for one loop iteration are shown in Example 5-23. 
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Example 5-23. Instructions for One Loop Iteration 


LOOP: 
LDH .D *rr9ptrt++[205],rr99 ;load rr[i9] [i9 
SMPY , Er 99,0 1 1:6;'s ;s=L_mult (rr[i9] [i9],_1_16) 
LDH D *rrOptrt++[5],rr09 j;load rr[i0] [i9 
SMP rr09,_1_8,s0 ;L_mult (rr[i0] [i9],_1_8) 
SADD .L s,s0,s ;s=L_mac(s,rr[i0] [i9],_1_8) 
LDH +D *rrlptrt++[5],rr19 j;load rr[il] [i9 
SMP Y : er 97 138591 ;L_mult (rr[il] [i9],_1_8) 
SADD els Sy,sl,s ;s=L_mac(s,rr[il] [i19],_1_8) 
LDH wD *rr2ptrt++[5],rr29 jload rr[i2][i9 
SMPY P ER29; A1.84:52 ;L_mult (rr[i2] [i9],_1_8) 
SADD ~L Sps2ys ;s=L_mac(s,rr[i2] [i9],_1_8) 
LDH .D *er3ptr++ [5], rr39 jload rr[i3] [i9 
SMP Y F C639 7-~1-28;,/S83 ;L_mult (rr[i3] [i9],_1_8) 
SADD .L S33 5S ;s=L_mac(s,rr[i3] [i9],_1_8) 
LDH my *rr4ptrt++[5],rr49 jload rr[i4] [i9 
SMPY : rr49,_1_8,s4 ;L_mult (rr[i4] [i9],_1_8) 
SADD .L s,s4,s ;s=L_mac(s,rr[i4] [i9],_1_8) 
LDH .D *PYroptxr++ 5)’, rr59 jload rr[i5] [i9 
SMP Y ‘ rr59,_1_8,s5 ;L_mult (rr[i5] [i9],_1_8) 
SADD .L s,s5,s ;s=L_mac(s,rr[i5] [i9],_1_8) 
LDH +D *rr6éptrt++[5],rr69 ;jload rr[i6] [i9 
SMP Y F rr69,_1_8,s6 ;L_mult (rr[i6] [i9],_1_8) 
SADD .L s,s6,s ;s=L_mac(s,rr[i6] [i9],_1_8) 
LDH -D *rr/pEertt+ [5], 2r79 ;load rr[i7] [i9 
SMPY F EEE; 31-2835) ;L_mult (rr[i7] [i9],_1_8) 
SADD L s,s7,s ;s=L_mac(s,rr[i7] [i9],_1_8) 
SADD da s,0x8000L, sround ; round(s) 
SHR :) sround, 16, rrv9 ;rrv[i9) 
STH mB) rrv9, *rrv9ptrt++[5] ;store rrv[i9 
{icntr]SUB . ALU Tent yr, 1 rentr ;decrement inner loop counter 
{icntr]B .S INNERLOOP ;branch to loop 


In Example 5-28, rrgptr, rroptr, rriptr, rraptr, rr3ptr, rr4ptr, rr5ptr, rr6ptr, rruptr, 
and rrvqpir are the pointers for rrfi9] [i9], rr[iO] [19], rrfit] [i9], rr[i2] [19], rrfi3] [19], 
rrfi4] [19], 4415] [19], rr[i6] [19], rrfi7] [19], and rrv[ig9]. 


The .D unit (the unit used the most) is used ten times per iteration. Although 
these instructions can be arranged in five cycles, any combination of the load 
hits the same memory bank, since any two values loaded are exactly 40 half- 
words apart; it still takes ten cycles for one rrv. 
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Next, consider unrolling the inner loop once. The C code becomes as shown 
in Example 5-24. 


Example 5-24. C Code for Unrolled Loop 


for (i9 = ipos[9]; i9 < L_CODE; i9 += 2*STEP) 
{ 
s = Lmult (rr[i9][i9], _1_16); 
S|] Temult. (er[i9+5] [a945)],. 1.16); 
s = L_mac (s, rr[i0][i9], _1_8); 
S = L_mac (S, rr[i0][i9+5], _1_8); 
s = Lmac (s, rr[il][i9], _1_8); 
S = L_mac (S, rr[il][i9+5], _1_8); 
& = di mac (8; rrpa2)] [TiS], 18) 3 
S = L_mac (S, rr[i2][i9+5], _1_8); 
s = Lmac (s, rr[i3][i9], _1_8); 
S = L_mac (S, rr[i3][i9+5], _1_8); 
Ss = Toma te, eer [els 1675 
S = L_mac (S, rr[i4][i9+5], _1_8); 
6 = Lemac (s, xrrlas] [9], —128)3 
S = L_mac (S, rr[i5][i9+5], _1_8); 
s = Lmac (s, rr[i6][i9], _1_8); 
S = L_mac (S, rr[i6][i9+5], _1_8); 
sv = D mac (Si BEL [aed ye kB) 
S = L_mac (S, rr[i7][i9+5], _1_8); 
rrv[i9] = round (s); 
rrv[i9+5] = round (S); 
} 


Example 5—25 shows the instructions for each iteration. 
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Example 5-25. Instructions for One Iteration of the Loop 


LOOP: 


*rr9ptr++[410],rr99 
rr99,_1_16,s 
*rr95ptr++[410],rr995 
rr995,_1_16,S 
*rrOptr++[10],rr09 
¥r09)_ 18,80 

s,s0,s 
*rrO05ptr++[10],rr095 
eeous ,. 8 S0 
$,50-S8 
*rrlptrt++[10],rr19 
Eri, 1s -8ys 

s,sl,s 
*rrld5ptr++[10],rr195 
Ert95;- 1645.1 
S518 
*rr2ptrt++[10],rr29 
rr29,_1_8,s2 

S,S2,8 
*rr2ptrt++[10],rr295 
rr295,_1_8,S2 
S,S24.8 
*rr3ptrt++[10],rr39 
reso, 1 8,85 

s,s3,s 
*rr3ptrt++[10],rr395 
Fx395, 1 8,53 
S,S3,S 
*rr4ptrt++[10],rr49 
rr49,_1_8,s4 

s,s4,s 
*rr4ptrt++[10],rr49 
rr49,_1_8,s4 

s,s4,s 
*rroptrt++[10],rr59 
e595 1 8,65 

s,s5,s 
*rr5ptrt++[10],rr595 
¥*595,-1 6,55 
S,S5,S 
*rr6ptrt++[10],rr69 
rr69,_1_8,s6 

s,s6,s 
*rr6ptrt++[10],rr695 
£xo95, 1. 8,56 
S,S6,S 
*rr7Jptrt++[10],rr79 
PEL Bey. 


i9][i9],_1_16) 


[i19+5],_1_16) 


,_1_8) 
19],_1_8) 
t+5],_1_8) 
19457) 21-8) 
,_1_8) 
19],_1_8) 
5) 71 8.) 
L9+9));_1: 8) 
,_1_8) 
19],_1_8) 
t+5],_1_8) 
i9+5],_1_8) 
,_1_8) 
19],_1_8) 
54 722k 8}) 
i9+5],_1_8) 
,_1_8) 
19],_1_8) 
,_1_8) 

19] 721458") 
,_1_8) 
19],_1_8) 
54] 7-18}) 
ORD Jey 1 8:) 
,_1_8) 
i9],_1_8) 
5) 71 8)) 
19+5],_1_8) 
,_1_8) 
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Example 5-25. Instructions for One Loop Iteration (Continued) 


[POeRnEC)LS 
[icntr]B 


U 


s,s7,s ;s=L_mac(s,rr[i7] [i9],_1_8) 
*rrJptrt++[10],rr795 ;load rr[i7] [i9+5] 

rr795., 1 85.S'7 ;Limult (rr[i7] [19+5],_1_8) 
S,S7,S ;S=L_mac(S,rr[i7] [i9+5],_1_8) 
s,0x8000L, sround ; round(s) 

sround, 16, rrv9 ;rrv[i9) 

rrv9, *rrv9ptr++[10] ;store rrv[i9 

S,0x8000L, Sround ;round(S) 

Sround, 16, rrv95 ;rrv[i9t5] 
rrv95,*rrv95ptr++[10] ;store rrv[i9t5] 
LeONneT;:2;-Uentr ;decrement inner loop counter 
INNERLOOP ;branch to loop 


In Example 5-25, rrqptr and rrqsptr are the pointers for rrfi9] [i9] and rrfi9+5] 
[19+5]. rrixptr and rrixsptr are the pointers for rrfix] [i9] and rr[ix] [19+5], (where 
ix=10, 11, ..., 17). rrvqotr and rrvqsptr are the pointers for rrv/i9] and rrv[i9+5], 
respectively. Again, .D is the unit used the most (twenty times per iteration). 


Notice that any pairs of rrfix] [i9], rrfiy] [19+5] never hit the same memory bank. 
The same is true for pairs rrv/i9], rrv[iq+5], as well as for rr9, rrv[i9+5]. For ease 
of understanding: 


Lj Load rrfix] [19], rr[ix] [i9+5] together 
Lj Load rrfi9] [19], rrfi9+5] [19+5] together 
LC Store rrvfi9j, rrv[i9+5] together 


In this way, each iteration takes ten cycles, without any memory bank hits. You 
double the speed by unrolling the loop once. 


The final assembly code is shown in Example 5-26. 
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Example 5-26. Final Assembly Code of rrr Computation 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


Texas Instruments, Inc 


Implementation of the rrv Computation in search_10i40 in EFR 


Compute two rrvs a time 


Total cycles 


55 


Register Usage: 


sol, 
~S2 


~S2 


M2 
~S1X 
~L2 
~S2X 


M2 
~L2X 
2ou 


M2 
-L1X 
~L2 
S11 
~S2X 
M1 


A 


16 


14 


7; B4 = 4) 

po BS caper elk 

; Bé = 12 

e BT ne es: 

; A8 - i4 

- Bo Eauee ee 

Py ALO. -= 16 

POAT SS] 

BSH 2S 

, A1lS --- &rr[0] [0] 

; AO --- &rrv[0] 

; B14 --- stack pointer 
410,A2 ; offset of rr[i9] [i9] 
410,B2 ; offset of rr[i9+5] [i9+5] 
82,B0 
B3,B0,B3 * Torr] 
B3,1,A13 
BO,2,BO0 ; 80 
A15,B2,B13 ¢ &cei 5) LS] 
B4,B0,B4 , [10] [0 
A15,10,B15 5 eee OVS) 
80,Al 
B5,B0,B5 £ ETI 
B3,A15,A3 ; &xrxr[i9] [19] 
B3,B13,B3 ; &xrr[i9t+5] [i9+5] 
A15,A13,A15 ; &rxr[0] [i9] 
B15,A13,B15 ; &rxr[0] [i9+5] 
A10,A1,A10 ; [16] [0 


KEKKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKKKKKKKKK KK 


Kk 


K* 


K* 


K* 


Kk 


K* 


Kk 


K* 


Kk 


K* 


Kk 


K* 


KKEKKKKKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KK KKK KKKEKKKKKKKKKKKKK 
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Example 5-26. Final Assembly Code of rrr Computation (Continued) 


PYU 
PYU 


PYU 
PYU 


PYU 


B6,B0,B6 
Al1,A1,A11 
B4,A15,A4 
B4,B15,B4 
*A3++[A2],A13 
*B3++[B2],B13 
AO,A13,A0 


B7,B0,B7 
A8,A1,A8 
B5,A15,A5 
BS,B1S,B5 
A10,A15,A10 
A10,B15,B10 
*A4++[10],A13 
*B4++[10],B13 


B9,BO,B9 
B6,A15,A6 
B6,B15,B6 
Al11,A15,A11 
Al1,B15,B11 
*A5++[10],A13 
*B5++[10],B13 


B7,A15,A7 
B7,B15,B7 
A8,A15,A8 
A8,B15,B8 
*A64++[10],A13 
*B6++[10],B13 


B9,A15,A9 
B9,B15,B9 
*A7++[10],A13 
*B7,B13 
2048,B7 


*A8++[10],A13 
*B8++[10],B13 
A13,B7,A12 
B13,B7,B12 
B7,1,B7 
AO,10,B0 


*A9++[10],A13 
*B9++[10],B13 
A13,B7,A15 
B13,B7,B15 


[i2] [0] 

[i7] [0] 

&rxr[id] [i9] 
&érr[i0] [19+5] 
load rr[i9][i9] 
load rr[i9+5] [i9+5] 
&rrv[i9] 

[i3] [0] 

[i4] [0] 

&rr[il] [i9] 
&rr[il] [i9+5] 
&rr[i6] [9] 
&rr[i6] [i9+5] 


load rr[i0] [i9] 
load rr[i0] [i9+5] 


[i9] [0] 
&rr[i2] [i9 
&érr[i2] [i9+5] 
&rxr[i7] [i9 
&rr[i7] [i9+5] 


Load rr[il] [19] 
load rr[il] [i9+5] 


&rxr[i3] [i9 
&rxr[i3] [i9+5] 
&rxr[i4] [i9 
&rxr[i4] [i9+5] 


load rr[i2][i9] 
load rr[i2] [i9+5] 


&rr [id] [i9 
&érr[i5] [19+5] 

Load rr [iS] [19 
load rr[i3] [i9+5] 
_1_16 


load rr[i4] [i9 
load rr[i4][i9+5] 

s=smpy (rr[i9] [i9],_1_16) 
S=smpy (rr[i9+5] [i9+5],_1_16) 
a8 
érrv[i19+5] 


load rr[i5] [i9 

load trlji5] [19+5] 

s0=smpy (rr[i0] [i9],_1_8) 
SO=smpy (rr[i0] [i9+5],_1_8) 
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Example 5-26. Final Assembly Code of rrr Computation (Continued) 


LDH D1 *A10++[10],A13 
LDH D2 *B10++[10],B13 
SMPY -M1LX A13,B7,A15 
SMPY .M2 BL3,B7,B15 
LDH Dd. *A11++[10],A13 
LDH D2 *B11++[10],B13 
SMPY 1x A13,B7,A15 
SMPY -M2 Bl3,B7,515 
SADD Para A12,A15,A12 
SADD L2 B12,B15,B12 
MVK si 3,Al 
SMPY -M1X A13,B7,A15 
SMPY M2 B13,B7,B15 
SADD L1 A12,A15,A12 
SADD L2 B12,B15,B12 
MVK si 32767,A14 
LOOP: 
SADD id: A12,A15,A12 
SADD L2 Big, BIS, B12 
SMPY 1X A13,B7,A15 
SMPY 2 B13,B7,;Bi5 
LDH D1 *A3++[A2],A13 
LDH D2 *B3++[B2],B13 
ADD s1 Al4,1,A14 
SMPY 1X A13,B7,A15 
SMP Y 2 B13,B7,B15 
SADD L1 A12,A15,A12 
SADD L2 B12,B15,B12 
LDH D1 *A4++[10],A13 
LDH D2 *B4,B13 
SMPY 1X A13,B7,A15 
SMPY 2 B13;B7,B15 
SADD L1 A12,A15,A12 
SADD L2 B12,B15,B12 
LDH D1 *A5++[10],A13 
LDH D2 *BS++[10],B13 
SMPY 1X A13,B7,A15 
SMPY 2 B1S,B7;B15 
SADD id: A12,A15,A12 
SADD L2 B12,B15,B12 
LDH D1 *A6++[10],A13 
LDH D2 *B6t+[ 10), BLS 
ADD ~S2X A7,10,B7 


, 
, 
, 
’ 


, 


load rr[i6] [i9 

load rr[i6] [i9+5] 

sl=smpy (rr[il] [i9],_1_8) 
Sl=smpy (rr[il] [i9+5],_1_8) 
load rr[i7] [i9 

load rr[i7] [i9+5] 

s2=smpy (rr[i2][i9],_1_8) 
S2=smpy (rr[i2] [i9+5],_1_8) 


s=sadd(s,s0) 
S=sadd(S,S0) 
loop counter 


s3=smpy (rr[i3] [i9],_1_8) 
S3=smpy (rr[i3] [i9+5],_1_8) 
s=sadd(s,s1) 


S=sadd(S,S1) 


s=sadd(s,s2) 

S=sadd(S,S2) 

s4=smpy (rr[i4] [i9],_1_8) 
S4=smpy (rr[i4] [i9+5],_1_8) 
* load rr[i9] [i9] 
* load rr[i9+5] [i19+5] 
32768 for rounding 


s5=smpy (rr[i5] [i9],_1_8) 
S5=smpy (rr[i5] [i9+5],_1_8) 
s=sadd(s,s3) 

S=sadd(S,S3) 

® Load er [20] (291 

* load rr[i0] [i9+5] 


s6é=smpy (rr[i6] [i9],_1_8) 
S6=smpy (rr[i6é] [i9+5],_1_8) 
s=sadd(s,s4) 

S=sadd(S,S4 
x Toad er [1. 


) 
][19] 
* load re[il] 


A 
1] [19+5] 

s7=smpy (rr[i7] [i9],_1_8) 
S7=smpy (rr[i7] [i9+5],_1_8) 
s=sadd(s,s5) 

S=sadd(S,S5) 


;* load rr[i2] [i9] 
;* load rr[i2] [i9+5] 


&rxr[i3] [19+5] 
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Example 5-26. Final Assembly Code of rrr Computation (Continued) 


A12,A15,A12 
B12,B15,B12 
*A7++[10],A13 
*B7,B13 
2048,B7 

LOOP 


A12,A15,A12 
B12,B15,B12 
*A8++[10],A13 
*B8++[10],B13 
A13,B7,A12 
B13,B7,B12 
B7,1,B7 
Al,1,Al1 


A12,A14,A14 
B12,A14,B4 
*A9++[10],A13 
*B9++[10],B13 
A13,B7,A15 
B13,B7,B15 


A14,16,A14 
B4,16,B4 
A13,B7,A15 
B13,B7,B15 
*A10++[10],A13 
*B10++[10],B13 


A13,B7,A15 
B13,B7,B15 
A12,A15,A12 
B12,B15,B12 
*A11++[10],A13 
*B11++[10],B13 


Al14,*A0++[10] 
B4, *B0++[10] 
A13,B7,A15 
B13,B7,B15 
A12,A15,A12 
B12,B15,B12 
A4,10,B4 
32767,A14 


’ 


’ 


s=sadd(s,s6 
S=sadd(S,S6 
* load rr[i3 
* Load er [is 
_1_16 


[i9] 
[i9+5] 


branch to the loop 


s=sadd(s,s7) 
S=sadd(S,8S7) 


load rr[i4] [i9] 
* load rr[i4] [i9+5] 
s=smpy (rr[i9] [i9],_1_16) 
* S=smpy (rr[i9+5] [i9+5],_1_16) 


a8 


decrement loop counter 


+ + 
a 
° 0 
» © 
aa 
KOK 
KOR 


rrv[i9] 
rrv[i9+5] 
sl=smpy (rr 

* Sl=smpy (rr 
load rr[ié] [ 

* load rr[i6é 

* s2=smpy (rr[i 
S2=smpy (rr[i 

* s=sadd(s,s0) 

* S=sadd(S,S0) 

* Load “er [ay 
load rr[i7 


store rrv[i9] 
store rrv[i9t 


19] 

19+5] 

0) [i9],_1 
OP [TIS] > 
1.) [2 9) pad 
1) [e9+54 7 
19: 

[i9+5] 

2) [19 )}y21 
2) [19+5], 
i9 

19+5] 

5] 


8) 
1_8) 


;* s3=smpy (rr[i3] [i9],_1_8) 
* S3=smpy (rr[i3] [i9+5],_1_8) 


, 


* s=sadd(s,s1) 

S=sadd(S,S1) 
* €rr (10) [i9ss 
end of LOOPX 


] 
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Notice that because of the shortage of registers: 


Lj B7servesas_7_16,_7_8and as the pointer for rrfi3]fi9+5] 
1 B4is both the value of rrvfi9+5] and the pointer of rrfi0/[i9+5] 
Lj A?4represents Ox8000L as well as rrv[i9] 


The last iteration of the loop could be expanded as the epilog of the loop to 
overlap with the prolog of the code following this part of the code. 
5.2.5 Implementation of the Index Search In search_10i40 


The index search in search_10i40 is the core of search_10i40. The C code is 
shown in Example 5—27. 


Example 5-27. Index Search for search_10i40 C Code 


#define L_CODE 


40 
#define STEP 5 
( 
( 


#define_1_16 
#define_1_8 


Word16) (32768L/16) 
Word16) (32768L/8) 


input: 
Wordl6 rr[L_CODE] [L_CODE], ipos[L_CODE], dn[L_CODE]; 


~ 


local variables/arrays: 


Wordl6 rrv[L_CODE]; 

Word1l6 i0,i1,12,13,14,i15,16,i17,i18,i9; /* defined on [0,L_CODE-1] */ 
Word1l6 ia,ib; 

Wordl6 ps,ps0,psl,ps2,sq,sq2; 

Wordl6 alp,alp_16; 

Word32 s,alp0,alpl,alp2; 


(Notice that the values of iO, i1, i2, 13, i4, i5, i6, i7 , psO, and alpO have 
been obtained before entering this loop.) 


Original C code 


sq = -l; 
alp = 1; 
ps 0; 


ib = ipos[9]; 


/* initialize 10 indices for i8 loop (see i2-i3 loop) */ 
for (i8 = ipos[8]; i8 < L_CODE; i8 += STEP) 
{ 


psl = add (ps0, dn[i8]); 
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Example 5-27. Index Search for search_10i40 C Code (Continued) 


alpl = L_mac (alp0O, rr[i8][i8], AVS) s 
alpl = L_mac (alpl, rr[i0][i8], _1_64); 
alpl = L_mac (alpl, rr[il][i8], 64)? 
alpl = L_mac (alpl, rr[i2][i8], _1_64); 
alpl = L_mac (alpl, rr[i3][i8], _1_64); 
alpl = L_mac (alpl, rr[i4][i8], _1_64); 
alpl = L_mac (alpl, rr[i5][i8], _1_64); 
alpl = L_mac (alpl, rr[i6][i8], _1_64); 
alpl = L_mac (alpl, rr[i7][i8], _1_64); 
/* initialize 3 indices for i9 inner loop (see i2-i3 loop) */ 
for (i9 = ipos[9]; 19 < L_CODE; i9 += STEP) 
{ 
ps2 = add (psi, dn[i9]); 
alp2 = L_mac (alpl, rrv[i9], _1_8); 
alp2 = L_mac (alp2, rr[i8][i9], _1_64); 
sq2 = mult (ps2, ps2); 
alp_16 = round (alp2); 
s = L_msu (L_mult (alp, sq2), sq, alp_16); 
LE (s: > <0) 
{ 
sq = sq2; 
ps = ps2; 
alp = alp_16; 
ia = i8; 
ib = i9; 
} 
} 
} 
where add(a,b) = _sadd(a<<16,b<<16)>>16 
L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
mult (a,b) = _smpy (a<<16,b<<16) >>16 
L_mult (a,b) =_smpy (a,b) 
round(a) = _sadd(a,0x8000L) >>16 
and L_msu (a,b,c) =_ssub(a,_smpy (b,c) ) 


This is a typical example of the performance being limited by data dependency 
constraints. In this case, the dependency is on the values of alp and sq. 
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5.2.5.1 Rearranging the C Code 


To avoid the unnecessary shift, ps, ps7, ps2, alp, alo_16, sq, and sq2 are 
implemented as Word32 variables. The effected calculations are: 


Original Implemented as 

psl = add (ps0, dn[i8]); psl = sadd(ps0O, dn[i8]<<16); 
ps2 = add (psl, dn[i9]); ps2 = sadd(psl, dn[i9]<<16); 
sq2 = mult (ps2, ps2); sq2 = smpyh(ps2,ps2); 

alp_16 = round(alp2); alp_16 = sadd(alp2,0x8000L); 


There is no need to compute s explicitly. Instead of implementing the following: 
s = Lmsu (L_mult (alp, sq2), sq, alp_16); 


if (s > 0) 
{ 


sq sq2; 

ps = ps2; 

alp = alp_16; 
ia = 18; 

Tr =" 9% 


you can do the following to fulfil the same task: 


if (_smpyh(alp,sq2) > _smpyh(sq,alp_16)) { 


sq = sq2; 

ps = ps2; 

alp = alp_16; 
ia = i8; 

ib = i9; 


hes 
5.2.5.2 Performance analysis 


The instructions to execute one iteration of the inner loop are shown in 
Example 5-28. 
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INNERLOOP: 
LDH 
SHL 
SADD 
SMP YH 
LDH 
SMPY 
SADD 
LDH 
SMPY 
SADD 
SADD 
SMP YH 
SMP YH 
CMPGT 
[cndr] MV 
[cndr] MV 
[cndr] MV 
[cndr] MV 
[cndr] MV 
[icntr]SUB 
[icntr]B 


n 


.D *dn9ptr++[5],dn9 


dn9,16,dn9h 


.L psl1,dn9h,ps2 


load dn[i9] 
dn[i9] << 16 
ps2 = sadd(psl, dn[i9] << 16) 


ps2,ps2,sq2 7 sq2 = smpyh(ps2,ps2) 
-D eETVPLEtS (5S) ,eErv ; load rrv[i9] 
3 rrv,_1_8,tmpl 7 smpy(rrv[i9], _1_8) 
.L alpl,templ,alp2 ; alp2=sadd(alpl,smpy (rrv[i9],_1_8) ) 
+D *rr89prtt++, rr89 ; load rr[i8] [i9] 
: rr89,_1_64,tmp2 7 smpy(rr[i8] [i9],_1_64) 
.L alp2,tmp2,alp2 ; alp2=sadd(alp2,smpy (rr[i8] [i9],_1_64) ) 
el alp2,0x8000L,alp_16 j; alp_16=sadd(alp2,0x8000L) 
alp,sq2,tmp3 7 smpyh(alp,sq2) 
sq,alp_16,tmp4 ; smpyh (sq, alp_16) 
-L tmp3,tmp4, cndr ; if(smpyh(alp,sq2) > smpyh(sq,alp_16) ) 
. ALU sq2,sq 
. ALU ps2,ps 
.- ALU alp_16,alp 
. ALU i8,ia 
. ALU i9,ib 
- ALU tentr,1,ientr 
“Ss INNERLOOP 7;branch to the loop 


Since both sq and alp are carried over and required from one iteration to the 
next, their values should be put in registers for soeed. You may notice that at 
least four cycles are required to compute a new sq and alp, and that the 
requirement on the functional units does not exceed four execution packets. 
Therefore, the inner loop can be effected in four cycles per iteration. 


For the outer loop, any pair of rrfix]/i8], rr[iy][i8], (where ix, iy=i0, i1, ..., 17), will 
definitely hit the memory bank if they are read together. Therefore, they should 
be loaded in one cycle each. 


5.2.5.3 Partitioning the Registers 


The total number of registers required for this cycle, including the registers for 
the pointer of the arrays, loop counters, intermediate results, etc., exceeds the 
number of registers available. To partition the registers without losing speed, 
the strategies are: 


_j For the inner loop, store the results of ps, ia, and ib, whose values are not 
used in this code. 


_j For the outer loop, store the pointers of arrays starting at rr[i5]/i8], rr[i6][i8], 
and rrfi7j[i8], wnose values are needed last in the outer loop. 
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Assume that before entering this code, &dn/[0], &ipos[O], &rr[OjfO], &rrv[OjfO], 
10, (1, 12, 13, 14, 15, 16, 17, psO, and alpO are known. Assume that the Word16 
integers are stored in the stack in the order /0, /7, 12, (3, 14, 15, i6, I7, ia, and ib, 
and that a pointer &local_16/0], pointing to /0, is also Known. The Word32 
integers and the pointers of the rr arrays are stored in the stack in the order 
ps0, ps, alp0O, alo1, &rrfid5jfi8}, &rrfi6jfi8], and &rr[i7jfi8]. The pointer, 


&local_ 32/0], pointing to ps0, is known as well. 


The C code is shown in Example 5-29. 


Example 5-29. The Index Search Modified C Code 


sq = -l; 
alp = 1; 
local_32[1] = 0; 
local_16[8] = ipos[8] 


local_16[9] = ipos[9]; 


for (18 = ipos[8]; i8 < L_CODE; i8 += STEP) 
{ 


psl = _sadd (local_32[0], dn[i8]<<16); 

local_32[3 sadd(local_32[2], _smpy(rr[i8] [i8 
local_32[3] = _sadd(local_32[3], _smpy(rr[i0] [i8 
local_32[3] = _sadd(local_32[3], _smpy(rr[il] [i8 
local_32[3] = _sadd(local_32[3], _smpy(rr[i2] [i8 
local_32[3] = _sadd(local_32[3], _smpy(rr[i3] [i8 
local_32[3] = _sadd(local_32[3], _smpy(rr[i4] [i8 
local_32[3] = _sadd(local_32[3], _smpy(rr[i5] [i8 
local_32[3] = _sadd(local_32[3], _smpy(rr[i6] [i8 
local_32[3] = _sadd(local_32[3], _smpy(rr[i7] [i8 


for (19 = ipos[9]; i9 < L_CODE; i9 += STEP) 
{ 


ps2 = _sadd(psl, dn[i9]<<); 

alp2 = _sadd(local_32[3], _smpy(rrv[i9], _1_8)); 
alp2 = _sadd(alp2, _smpy(rr[i8][i9], _1_64)); 
sq2 = _smpyh(ps2, ps2); 


alp_16 = _sadd(alp2,0x8000L) ; 


/* initialize 10 indices for i8 loop (see i2-i3 loop) 


~ 


~ 


, 
, 
, 
, 


/* initialize 3 indices for i9 inner loop (see i2-i3 loop) 


a) 
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Example 5-29. The Index Search Modified C Code (Continued) 


if (_smpyh(alp,sq2) > _smpyh(sq,alp_16) ) 
{ 


sq = sq2; 
local_32[1]= ps2; 
alp = alp_16; 

local_16[8] 
local_16[9] 


5.2.5.4 Final Assembly Code 


The final code consists of the following steps: 


Step 1: Load i0, /7, ... 19, alp0, and ps0; and initialize sq, ia, and ib. Part of 
the code overlaps that of the last iteration of the code in Section 
5.2.3 on page 5-20. 


Step 2: Obtain the pointer for the arrays started at rrfiOJfi8}, rrfit][i8], ... 
rrfi7}{i8], rrfi8}{i9}, rrvfi9], dn[i8], and dnfi9]. 


Step 3: Load rrfiO}fi8], rr[i1 ]fi8], ... rr[i7]{i8] and dn[i8], compute the new ps1 
and alp7, update the pointers and store pointers &rr[i5]/ig], 
&rfi6fi8], and &rrfi7][i8]. 


Step 4: Load rrfi8]fi9], rrv[i9], and dn[i9]. Compute alp2, ps2, alp_16, sq2 
and perform a comparison. Update the parameters /a, ib, alp, sq, 
and ps based on the comparison result. Repeat this step eight 
times. 


Step 5: Reload the values of ps0 and alp0, and &rrfi5]fi8], &rrfi6]fi8], and 
&rr[i7]{i8]. Verify that Step 3 has been repeated eight times. If not, 
go to Step 3. If yes, exit. 


To avoid memory bank hits, arrays rrand rrv must not be aligned on the same 
word or half-word boundary. The same applies to arrays rrand dn. As you can 
see in the final assembly code shown in Example 5-30, there are several 
places that LDH (or STH) and LDW (or STW) occur in the same execution 
packet. They belong to one of the two categories; that is, always loading values 
from or storing values to the same memory locations, as in iterations like the 


following: 
LDW 7 Bal *+A6[3],A11 ; load alpl 
| | [B2] STH £02 B13, *+B6[9] ; store ib=i9 
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The following are used in the inner loop in different memory locations such as the 
outer loop: 


[B2] STW .D1 B11, *+A6[1] ; store ps 
I | LDH .D2 *B10++[5],A5 ; load rr[i5] [i8] 


In the former case, the memory bank hits can be completely eliminated by 
allocating corresponding arrays in memory properly. Memory bank hits occur 
in every other iteration in the latter case, however. Although, in general, you 
should avoid writing such code,the performance of the prelog of the outer loop 
after the first iteration is limited by the .D unit in this case. You still save some 
cycle counts compared to not doing so. 


Notice that you overlapped the last three iterations of the inner loop with part 
of the prelog of the outer loop to improve the performance. 


Example 5-30. Final Assembly Code of the search_10i40 Index Search 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


Texas Instruments, Inc clk 
K* 

Implementation of The Index Search in search_10i40 in EFR aa 
K* 

Total cycles = 400 (among the 400 cycles, 10 cycles are caused ae 
by memory bank hits Ae 

K* 

Register Usage: A B is 
K* 

alts) 15 olka! 


KKEKKKKKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKKKKKKKK 


KKEKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKKKKKKKKKKKKKKKK 


-D1 


-D1 
-D2 
~S1X 


K* 


; Al3 --- &ipos[0] and alp 

; Bé --- &local_16[0] 

; A6é --- stack pointer, point to élocal_32[0] 

; B8 --—- &rr[0] [0] 

, A4 -—- &rrv[0] 

; Bl4 -—- &dn[0] 

; Bl --- reserved for the counter of the 

; outmost loop in search_10i40 
*+A13[8],A7 ; load i8 = ipos[8] 
*+A13[9],B13 ; load i9 = ipos[9] 
*B6,A13 ; load i0 
B6o,A5 ; &local_v16[0] 
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Example 5-30. Final Assembly Code of the search_10i40 Index Search (Continued) 


Ses Pprpre 


Ss Po 


PYU 
BYU 


PYU 


PYU 
PYU 


D2 
-D1 
sl 


*+B6[2],B9 
*+A5[1],A14 
0,A8 


*+A5[4],A15 
*+B6[3],B10 
80,A0 
80,B0 


A8, *+A6[1] 
*+B6[5],B11 
A7,1,B10 
A7,A0,A12 


A7,*+A5[8] 
B13, *+B6[9] 
B8,B10,B2 
A13,B0,B3 


*A6,B15 
*4+B6[6],Al 
Al2,B2,A12 
B14,B10,B7 
B8,A12,B8 
Al4,A0,A14 
B9,BO,B9 


*+A6[2],Al11 
*+B6[7],B5 
B13,B13,B12 
B3,B2,B3 


*A12,A5 
*B7++[5],B12 
B14,B12,B14 
Al4,B2,A14 


*B3++[5],A5 
B9,B2,B9 
B10,A0,A9 


*A144++[5],A5 
A4,B12,A4 
A15,A0,A15 
B11,B0,Bll 


, 


’ 


load i2 

load il 

could insert two .D 

units here for the store 

of rrv[i9+30] and rrv[i9+35] 
in the code which this piece 
immediately follows 


load i4 
load i3 


ps=0 

load i5 
[0] [18] 
[i8] [0] 


store ia=i8 
store ib=i9 
&rr[0] [i8] 
[i0] [0] 


load ps0 

load i6 

érr[i8] [i8] 
&dn [i 
érr[i 
[il] [ 
[i2] [ 


8 
8] [0] 
0 
0 


load alp0d 
load i7 
[0] [i9 
&rr [id] [i8 


load rr[i8][i8] 
load dn[i8 
&dn[i9] 
&rr[il] [i8 


load rr[i0] [i8] 
&rr[i2] [i8 
[i3] [0] 


load rr[il][i8] 
&rrv[i9 
[i4] [0] 
[i5] [0] 
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Example 5-30. Final Assembly Code of the search_10i40 Index Search (Continued) 


SADD 
SMPY 


OUTERLOOP: 


LDH 
LDH 
SADD 
SUB 
SMPY 


LDH 
SADD 
SMPY 


*B9++[5],A5 
256,A0 
A9,B2,A9 
Al,A0,Al 


*A9++[5],B12 
B11,B2,B10 
7,A2 

512,B0 
A15,B2,A15 
B8,B12,B4 
B5,B0,B5 


*A15++[5],A5 
AO,1,A0 


B12,16,Bll 
Al,B2,Al 
A5,A0,A8 


*B10++[5],A5 
-1,A3 
A5,A0,A8 


*A1++[5],B12 
B5,B2,B1l 
AO,7,A13 
Al1,A8,All 
B15,B11,B15 
A5,A0,A8 


*B11++[5],A5 
A11,A8,A11 
A5,A0,A8 


*A4++[5],A5 
*B4++[5],B12 
A11,A8,A11 
B13,5,B13 
B12,A0,A8 


*B14++[5],B12 
A11,A8,A11 
A5,A0,A8 


load rr[i2 
A0=_1_128 
&érr[i3] [i8 
[i6] [0 


load rr[i3 
érr[i5] [i8 
outer loop 
BO=_1_64 
érr[i4] [i8 
&érr[i8] [i9 
[i7] [0 


load rr[i4 
_1_64 


dn[i8] << 1 
&rr[i6] [i8 
smpy (rr[i8 


load rr[i5d 
sq=-1 
smpy (rr[i0 


load rr[i6 
&érr[i7] [i8 
alp=0x10000 


[i8] 


counter 


[i8] 


6 


i8],_1_128) 


i8] 


i8],_1_64) 


i8] 


alpl=sadd(alp0, smpy (rr[i8] [i8],_1_128) ) 


psl 
smpy (rr[il] 


load rr[i7] 
alpl=sadd(a 
smpy (rr[i2] 


load rrv[i9 
load rr[i8 
alpl=sadd(a 


smpy (rr[i3 
load dn[i9 


alpl=sadd(a 
smpy (rr[i4 


i8],_1_64) 


i8] 
lp1, smpy (rr[i0] [i8],_1_64)) 
i8],_1_64) 


i9] 
lpl, smpy (rr[il] [i8],_1_64) ) 


i8],_1_64) 


lp1, smpy (rr[i2] [i8],_1_64)) 
i8],_1_64) 
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Example 5—30. Final Assembly Code of the search_10i40 Index Search (Continued) 


STW DL B10, *+A6[4] ; store &rr[id] [i8+5] 
SADD 2 Lit Al11,A8,A11 ; alpl=sadd(alp1,smpy (rr[i3] [i8],_1_64) ) 
SMPY -M1 A5,A0,A8 , smpy(rr[i5] [i8],_1_64) 
STW D1 Al, *+A6[5] ; store &rr[i6] [i8+5] 
SHL Sl AO, 6,A10 ; Ox8000L 
SADD L1 Al11,A8,A11 ; alpl=sadd(alp1,smpy (rr[i4] [i8],_1_64) ) 
SMP Y 1X B12,A0,A8 , smpy (rr[i6] [i8],_1_64) 
LDH Di *A4++[5],A5 ;* load rrv[i9] 
LDH -D2 *B44++[5],B12 ;* load rr[i8] [i9] 
SHL Sl AO, 3,A0 ; AQ=_1_8 
SADD -L1 A11,A8,A11 ; alpl=sadd(alp1,smpy (rr[i5] [i8],_1_64) ) 
SMPY -ML A5,A0,A8 , smpy(rr[i7] [i8],_1_64) 
LDH -D2 *B14++[5],B12 7;* load dn[i9 
|| SADD = Lil. Al1,A8,A11 ; alpl=sadd(alpl, smyp(rr[i6é] [i8],_1_64) ) 
I | SMPY -M1 A5,A0,A5 ; smpy (rrv[i9],_1_8) 
I | SMPY .M2 B12,B0,B12 , smpy(rr[i8] [i9],_1_64) 
STW -D B11, *+A6[6] ; store &rr[i7] [i8+5] 
| | SHL oe Bi2,; 16,31 pcdnl[igo] << 16 
I | SADD alist A11,A8,A11 ; done alpl=sadd(alp1, smpy (rr[i7] [i8],_1_64) ) 
STW eDiL All, *+A6[3] ; store alpl 
| SADD ~L1 Al11,A5,A5 ; alp2=sadd(alpl,smpy (rrv[i9],_1_8) ) 
| | SADD ~L2 Bl1,B1i5,B5 7 ps2=sadd(psl1,dn[i9]<<16) 
LDH -D1 *A4++[5],A5 7** load rrv[i9] 
l| LDH -D2 *B4++[5],B12 7** load rr[i8] [i9] 
1 | B 2S2 INNERLOOP ; branch to the innerloop 
| SADD .L1X A5,B12,Al1 , alp2=sadd(alp2,smpy (rr[i8] [i9],_1_64) ) 
im SMPYH .M2 B5,B5,B8 7 sq2=smpyh (ps2,ps2) 
LDH -D2 *B14++[5],B12 7** load dn[i9] 
|| MVK 2S 4,Al ; innerloop counter 
|| MVK EoD 0,B2 
lI SADD -L1 Al1,A10,A8 , alp_16 = sacc(alp2, 0x8000L) 
| SMPY -M1 A5,A0,A5 7* smpy (rrv[i9],_1_8) 
I | SMPY ~M2 B12,B0,B12 7* smpy (rr[i8] [i9],_1_64) 
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Example 5-30. Final Assembly Code of the search_10i40 Index Search (Continued) 


INNERLOOP: 

LDW -D1 *+A6[3],A11 
| | [B2] STH -D2 B13, *+B6[9] 
| SHL o2 B12,16,B10 
| ADD ~L2 B13,5,B13 
| SMP YH M1 A8,A3,A11 
| 


SMP YH .M2X B8,A13,B10 


B2] STW .D1 B11, *+A6[1] 
||[B2] STH .D2 A7,*+B6[8] 
| MV .S2 B5,B11 
| SADD .L1 Al1,A5,A5 
| SADD .L2 B10,B15,B5 
LDH .D1 *A44+4+(5],A5 
|| LDH “D2 *B4++[5],B12 
|| [Al] SUB .S1 Al,1,Al 
||{Al] B .S2 INNERLOOP 
|| SADD Fail A5,B12,A11 
|| CMPGT  .L2X B10,A11,B2 
|| SMPYH  .M2 B5,B5,B8 
[B2] MV .D1 A8,A13 
|| LDH .D2 *B14++[5],B12 
||[B2] mv .S1X B8,A3 
|| SADD Ll Al1,A10,A8 
|| SMPY ML A5,A0,A5 
|| SMPY .M2 B12,B0,B12 
[B2] SstTWw .D1 B11, *+A6[1] 
|| [B2] STH .D2 AT, *+B6[8] 
| SHL £82 B12,16,B10 
|| MV JL2 B5,Bll 
|| SMPYH .M1 A8,A3,Al11 
l | 


SMPYH .M2X B8,A13,B10 


LDW .D1 *+A6[2],Al11 
|| [B2] STH .D2 B13, *+B6[9] 
|| MV .S2X  A6,B2 
|| SADD sed: Al1,A5,A5 
|| SADD TD B10,B15,B5 


’ 
, 


’ 


load alpl 
store ib=i9 


,* dn[i9]<<16 


i19=i19+STEP 
smpyh (alp_16,sq) 
smpyh (alp, sq2) 


store ps 
store ia = i8 


*alp2=sadd(alpl, smpy (rrv[i9],_1_8) ) 
* ps2=sadd(psi1,dn[i9]<<16) 


7*** load rrv[i9+10] 
pee Dead er a3) [29410] 


, 
, 
, 
, 


’ 


, 


decrement innerloop counter 
branch to INNERLOOP 


; *alp2=sadd(alp2, smpy (rr[i8] [i9],_1_64) ) 


if smpyh(alp,sq2) > smpyh(alp_16,sq) 


;* sq2=smpyh (ps2,ps2) 


alp=alp_16 


7*** load dn[i9t+10] 


, 
, 


’ 


Re BO 


, 


sq=sq2 


;* alp_16=sadd(alp2, 0x8000L) 
;*** AO = _1 8 


= _1_64 
end of innerloop 


store ps 

store ia = i8 
dn[i9]<<16 

ps2 

smpyh (alp_16,sq) 
smpyh (alp, sq2) 


load alp0d 

store ib=i9 

stack pointer 
alp2=sadd(alp2, smpy (rr[i8] [i9],_1_64) ) 
ps2=sadd (ps1, dn[i9]<<16) 
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Example 5-30. Final Assembly Code of the search_10i40 Index Search (Continued) 


NDnNnH 


< 


ne PHnNnNe 


DK 
DD 


PGT 


DK 


PY 


-M2X 


-L2X 


*+A6[5],Al 
*B2,B15 
205,A0 
A5,B12,Al1l 
B10,A11,B2 
B5,B5,B8 


*+4+A12[A0],A5 
*B7++[5],B12 
A8,A13 
-90,B14 

B8,A3 


*+A6[4],B10 
*B3++[5],A5 
-90,A4 
A11,A10,A8 
B13,5,B13 


*A14++[5],A5 
B13, *+B6[9] 
A8,A3,A10 
B8,A13,B10 


256,A0 
*+A6[6],B11 
*B9++[5],A5 
OUTERLOOP 


*A9++[5],B12 
A7, *+B6[8] 
B13,5,B13 
B10,A10,BO0 


*A154++[5],A5 
B13, *+B6[9] 
AO,1,A0 
-35,B13 
A8,A13 
A5,A0,A 


&rr[i6] [18] 
load ps0 


alp2=sadd(alp2,smpy (rr[i8] [i9],_1_64) ) 
if smpyh(alp,sq2) > smpyh(alp_16,sq) 
sq2=smpyh (ps2, ps2) 


load rr[i8][i8] 
load dn[i8] 
alp=alp_16 
é&dn[i9] 

sq=sq2 


érr[i5] [i8 
load rr[id 
&rrv[i9] 


alp_l6=sadd(alp2, 0Ox8000L) 


] 
] [18] 


load rr[il] [i8] 
store ib=i9 
smpyh (alp_16,sq) 
smpyh (alp, sq2) 


ab 128 

&rr[i7] [i8] 

load rr[i2][i8] 
branch to OUTERLOOP 


load rr[i3][i8] 

store ia = i8 

update i9 

if smpyh(alp,sq2) > smpyh(alp_16,sq) 


load rr[i4][i8] 
store ib=i9 


_1_64 
update i9 
alp=alp_16 


smpy (rr[i8] [i8],_1_128) 
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Example 5-30. Final Assembly Code of the search_10i40 Index Search (Continued) 


BO] 


-L2X 


.D1B11, *+A6[1] 


*B10++[5],A5 
B8,A3 
B12,16,B11 
A2,1,A2 
A5,A0,A8 


*A1++[5],B12 
AT, *+B6[8] 
310,B4 
Al1,A8,A11 
B15,B11,B15 
A5,A0,A8 


B5,*+A6[1] 
*B11++[5],A5 
A7,5,AT7 
Al1,A8,A11 
AO, BO 
A5,A0,A8 


store ps 
load rr[i5][i8 
sq=sq2 
dn[i8] << 16 
decrement OUTERLOOP counter 
smpy (rr[i0] [i18],_1_64) 


load rr[i6][i8 
store ia = i8 
&rr[i8] [i9 
alpl=sadd(alp0,smpy (rr[i8] [i8],_1_128) ) 
psl = sadd(ps0,dn[i8]<<16) 

smpy (rr[il] [i8],_1_64) 


store ps 

load rr[i7] [i8] 

update i8 
alpl=sadd(alpl,smpy (rr[i0] [i8],_1_64) ) 


_1_64 


smpy (rr[i2] [i8],_1_64) 
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Implementation of residu.c, the FIR Filter, In GSM EFR 


Example 5-31 shows the FIR filter, residu.c, in GSM EFR. 


Example 5-31. residu.c C Code 


#define Word1l6 
Original C code 


short #define Word32 int 


/* m = LPC order == 10 */ #define m 10 


void Residu 


( 
[ 


Word16 a[], /* (i) : prediction coefficients 
Word16 x[], /* (i) : speech signal 

Wordl6 y[], /* (0) : residual signal 

Wordl6 lg /* (i) : size of filtering 


Wordl6 i, Jj; 


ey: 
Af 
“ff 
Ai 


Word32 s; 
for (i = 0; i < lg; i++) 
{ 
s = L_mult (x[i], a[0]); 
for (j = 1; j <= m jtt) 
s = Lmac (s, alj], xli - j]); 
s = L_shl (s, 3); 
yf{i] = round (s); 
} 
return; 
} 
where L_mult(a,b) = _smpy (a,b) 
L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
L_shl(a,b) = (b>0) ? _sshl(a,b) : _shr(a,b) 
round(a) = _sadd(a,0x8000L) >>16 
and lg = 40. 


5.2.6.1 Rearranging the C Code 


It is obvious that L_shi(s,3) can be implemented simply as _ sshi(s,3). Since 
array ahas dimension m + 1 = 11 and the inner loop is always executed 10 
times per outer loop iteration, you can completely unroll the inner loop to gain 
the speed by representing array a with registers. Since a is a short integer 
array, array a requires at most six registers for full representation. You can 
assign one register only for a/0] for two reasons: 1) a/0/is always a constant, 
4096 and, 2) _ shr(Ox8000L,3) = 4096. You can change the order of rounding 
and left shift to save one register. (If not, you need another register for 


Ox8000L.) The C code after complete 
Example 5-32. 


Applications Programming 


inner loop unrolling is shown in 
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Example 5-32. residu.c C Code After Rearrangement Using Intrinsics 


for (i = 0; i < lg; i++) 

{ 
s _smpy(x[i], a[0]); 
s = _sadd(s,_smpy(a[1], x[i-1])); 
s = _sadd(s,_smpy(a[2], x[i-2])); 
s = _sadd(s,_smpy(a[3], x[i-3])); 
s = _sadd(s,_smpy(a[4], x[i-4])); 
s = _sadd(s,_smpy(a[5], x[i-5])); 
s = _sadd(s,_smpy(a[6], x[i-6])); 
s = _sadd(s,_smpy(a[7], x[i-7])); 
s = _sadd(s,_smpy(a[8], x[i-8])); 
s = _sadd(s,_smpy(a[9], x[i-9])); 
s = sada (s,_empy (a LO] 4. (XFL e100) 33 
s = _sadd(s Pou 
s = _sshl(s 
yli] = _shr fas 16); 

} 


5.2.6.2 Performance Analysis 
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The performance is limited by .L unit for _sadd or by .M unit for _smpy, since 
both of these two units are used at least 11 times per iteration. In other words, 
it takes at least six cycles per iteration. You may chose to unroll the loop once 
to compute two ys per iteration for the following reasons: 


1) To satisfy the ordering property of _sadd 


2) To maximize speed. Eleven cycles are required to compute two ys, while 
six cycles are needed for one y. 


The C code is is shown in Example 5-33. 
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Example 5-33. Implemented C Code for residu.c 


for (i = 0; i < lg; i+=2) 

{ 
sO smpy(x[i], a[0]); 
sl = _smpy(x[it+l], a[0]); 
sO = _sadd(s0,_smpy(a[1l], x[i-1])); 
sl = _sadd(sl,_smpy(a[l], x[i])); 
sQ = _sadd(s0,_smpy(a[2], x[i-2])); 
sl = _sadd(sl,_smpy(a[2], x[i-1])); 
sO = _sadd(s0,_smpy(a[3], x[i-3])); 
sl = _sadd(sl,_smpy(a[3], x[i-2])); 
s0O = _sadd(s0,_smpy(a[4], x[i-4])); 
sl = _sadd(sl,_smpy(a[4], x[i-3])); 
s0OQ = _sadd(s0,_smpy(a[5], x[i-5])); 
sl = _sadd(sl,_smpy(a[5], x[i-4])); 
sQ = _sadd(s0,_smpy(a[6], x[i-6])); 
sl = _sadd(sl,_smpy(a[6], x[i-5])); 
s0OQ = _sadd(s0O,_smpy(a[7], x[i-7])); 
sl = _sadd(sl,_smpy(a[7], x[i-6])); 
sO = _sadd(s0,_smpy(a[8], x[i-8])); 
sl = _sadd(sl,_smpy(a[8], x[i-7])); 
s0O = _sadd(s0,_smpy(a[9], x[i-9])); 
sl = _sadd(sl,_smpy(a[9], x[i-8])); 
sO = _sadd(s0,_smpy(a[10], x[1i-10])); 
sl = _sadd(sl,_smpy(a[10], x[1i-9])); 
sO = _sadd(s0O, a[0]); 
sl = _sadd(sil, a[0]); 
sO = _sshl(s0,3); 
sl = _sshl(s1,3); 
yfi] = _shr(s0,16); 
y[itl] = _shr(s1,16); 

} 


5.2.6.3 Final Assembly Code for residu.c 


The final assembly code is shown in Example 5-34. 
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Example 5-34. residu.c Final Assembly Code 


K* 
bead Implementation 
K* 
ee Compute two ys 
K* 
ae Total cycles = 
K* 
K* 
K* 


K* 


LDH D2 
LDW D1 
| | LDW D2 
LDW D1 
| | LDW D2 
LDW D2 
LDW D2 
| | LDW D1 
LDW .D2 
| | MVK eS 
| | MV .L1xX 
| | MV .S2 
LOOP 
SMPY -M1 
SMP YHL 2X 
LDW D1 
[!A2] SADD Ll 
[!A2] SADD 2h2 
SMPYHL .M1X 
SMPY .M2X 
[!A2] SADD Ll 
[!A2] SADD L2 


= 237 
Register Usage: 


of residu.c EFR 


at a time 


(lg/2+1) *11+6 
(for lg 


, 
’ 


’ 


+ 


B4++,BO 


*A4--, B3 
B4++,B4 


+ 


*A4--, Al 
B4++,B1 


+ 


+ 


B4++,B5 


*B4++,B6 
*A4—--,A3 


*B4++,B7 
1,A2 
BO, AO 
B6,B2 


A3,A0,A8 
A3,B0,B8 
*A4--, Al 
A8,A9,A9 
B8,B9,B9 


Al,B4,A8 
A3,B4,B8 
A8,A9,A9 
B8,B9,B9 


B 
10 


&a[0] 

&x [0] 

&y [0] 

lg 

load a[0O] = 4 
load x[0] & x 
load a[l] & a 
load x[-2] & 
load a[3] &a 
load a[5] & a 
load a[7] & a 
load x[-4] & 
load a[9] & a 
to take care 
a[0O] = 4096 


loop counter, 


smpy(x[0],a 
smpy(x[l],a 
load x[-6] & 
sO = sadd(s0, 
sl = sadd(sl, 


smpy (x[-1l],al 
smpy (x[0],a[1 
sO = sadd(s0, 
sl sadd(sl, 


10] 
of the first execution 


L_SUBFR/2 


1)) 

]) 

smpy (x 
smpy (x 


KAEKKKKKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KK 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


K* 


KAEKKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KK 
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Example 5—34. residu.c Final Assembly Code (Continued) 


SMP YLH -M1X A1,B4,A8 , smpy(x[-2],a[2]) 
SMP YH .M2X A1,B4,B8 ; smpy(x[-1],a[2]) 
ADD .S1 A8,0,A9 ; sO=smpy (x[0],a[0]) 
ADD S2 B8,0,B9 ; sl=smpy (x[1],a[0]) 
LDW -D1 *A4--,A3 ; load x[-8] & x[-7] 
[!A2] SADD Ll A9,A0,A9 7 sO = sadd(s0, 4096) 
[!A2] SADD 32 B9,BO,B9 7 Sl = sadd(sl, 4096) 
SMPYHL -M1X A3,B1,A8 ; smpy (x[-3],a[3]) 
SMPY -M2X Al1,B1,B8 , smpy(x[-2],a[3]) 
SADD ~L1 A8,A9,A9 ; sO = sadd(s0O, smpy(x[-1],a[1])) 
SADD @LiZ B8,B9,B9 ; Sl = sadd(sl, smpy(x[0],al[1])) 
[!A2] SSHL 971 A9,3,A7 ; sO = L_shl(s0,3) 
[!A2] SSHL ~S2 B9,3,Bl1 ; sl = L_shl(s1,3) 
SMP YLH -M1X A3,B1,A8 , smpy(x[-4],a[4]) 
SMP YH -M2X A3,B1,B8 , smpy (x[-3],a[4]) 
SADD op all A8,A9,A9 ; sO = sadd(s0O, smpy(x[-2],a[2])) 
SADD 612 B8,B9,B9 ; Sl = sadd(sl, smpy(x[-1],a[2])) 
LDW sDL *RA++[6],Al ; load x[-10] & x[-9] and update the 
pointer 
[!A2] SHR rotil A7,16,A7 7 y{[0O] = shr(s0O, 16) 
[!A2] SHR ~S2 B10,16,B10 , yl] = shr(sl, 16) 
; to the new &x[0] 
SMPYHL .-M1X Al1,B5,A8 7 smpy(x[-5],a[5]) 
| SMPY -M2X A3,B5,B8 7 smpy(x[-4],a[5]) 
| SADD -L1 A8,A9,A9 ; sO = sadd(s0O, smpy(x[-3],a[3])) 
| SADD ~L2 B8,B9,B ; Sl = sadd(sl, smpy(x[-2],a[3])) 
| [!A2] STH -D1 A7T, *A6++ ; store y[0] 
| [B2] SUB 2S2 B2,2,B * Gecrement loop counter 
| [B2] B S1 LOOP ; branch to the loop 
SMPYLH .-M1X Al1,B5,A8 ; smpy(x[-6],a[6]) 
SMPYH -M2X Al1,B5,B8 7 smpy(x[-5],a[6]) 
SADD SLL A8,A9,A9 ; sO = sadd(s0O, smpy(x[-4],a[4])) 
SADD 32 B8,B9,B9 ; Sl = sadd(sl, smpy(x[-3],a[4])) 
LDW ails *R4—-—,A3 7* load x[0O] & x[1] for the next iteration 
SMPYHL .M1X A3,B6,A8 7 smpy (x[-7],a[7]) 
SMPY -M2X Al,B6,B8 ; smpy(x[-6],al[7]) 
SADD isl. A8,A9,A9 ; sO = sadd(s0O, smpy(x[-5],a[5])) 
SADD -L2 B8,B9,B9 ; Sl = sadd(sl, smpy(x[-4],a[5])) 
LDW sD. *AR4—-—-, Al 7* load x[-1] & x[-2] 
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Example 5—34. residu.c Final Assembly Code (Continued) 


SMPYLH 
SMP YH 
SADD 
SADD 
[!A2] STH 


SMPYHL 
SMPY 
SADD 
SADD 
[A2 ] SUB 
LDW 


SMPYLH 
SMP YH 
SADD 
SADD 


1X 


-M2X 
-L1 
-L2 
-D1 


A3,B6,A8 
A3,B6,B8 
A8,A9,A9 
B8,B9,B9 
B10, *A6++ 


Al,B7,A8 
A3,B7,B8 
A8,A9,A9 
B8,B9,B9 
A2,1,A2 
*A4--, A3 


Al,B7,A8 
Al,B7,B8 
A8,A9,A9 
B8,B9,B9 


’ 


smpy (x[-8],a[8]) 
smpy (x[-7],a[8]) 


sO = sadd(s0, smpy(x 
sl = sadd(sl, smpy(x 


store y[1 


smpy (x[-9],a[9]) 
smpy (x[-8],a[9]) 


sO = sadd(s0O, smpy(x 
sl = sadd(sl, smpy(x 


;* load x[-3] & x[-4] 


smpy (x[-10],a[10]) 
smpy (x[-9],a[10]) 


sO = sadd(s0, smpy(x 
sl = sadd(sl, smpy(x 


[-6],a[6]})) 
[-5],a[6])) 


[-7],al7])) 
[-6],al7])) 


[-8],a[8])) 
[-7],a[8])) 


5.2.7 
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There is no memory bank hit within the loop. To avoid a memory bank hit within 
the prelog of the loop, arrays a and x must be allocated like the a/1] and x/0] 
relative offset. Some of the instructions in the loop cannot be executed in the 


first iteration. Register A2 indicates which instructions these are. 


Routine Lag_max() performs an open-loop pitch search and computes the 
normalized correlation for the selected lag. This subsection illustrates the 
implementation of the lag search. The lag search C code is shown in 


Example 5-35. 


Implementation of the Lag Search In the Routine lag_max() 


Example 5-35. Lag Search C Code for lag_max() 


Implementation of GSM EFR Vocoder 


#define Word1l6 short 
#define Word32 int 

#define MIN_32 0x80000000L 
#define PIT_MAX 143 

#define L_FRAME 160 

input: 


Word16 scal_sig[PIT_MAX+L_FRAME]; 
Wordl6 scal_fac; 


Word1l6 L_frame, lag_min, lag_max; 
local variables: 

Wordl6 i, Jj, *p, *pl, p_max; 

Word32 t0, max; 


return: 
Word16 p_max; 


Original C code 


max = MIN_32; 
for (i = lag_max; i >= lag_min; i--) 
{ 
p = scal_sig; 
pl = &scal_sig[-il; 
tO = 0; 
for (j = 0; 3 < L_frame; jt+t+, pt+, pltt) 
{ 
tO = L_mac (tO, *p, *pl); 
} 
if (Losub (t0, max) >= 0) 
{ 
max = t0; 
p_max 13 
} 
} 
where L_mac(a,b,c) = _sadd(a,_smpy (b,c) ) 
L_sub(a,b) = _ssub(a,b) 
L_frame = L_FRAME/2 = 80 
and the search range (lag_min, lag_max) is (18,35), 


(pointed at scal_sig[PIT_MAX] 
(not used in this part of the code) 


when passed) 


(36,71), or (72,143). 
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5.2.7.1 Rearranging The C Code and Unrolling The Loops 


This algorithm is preferable to smaller lag candidates, because it performs a 
comparison with if(L_sub(t0,max) >= 0). notice that the search starts from 
lag_max. Since we do not have single instruction for >= (or <=) comparison, 
you must change the search order to start from /ag_minto compare with if(t0 
> max). p_max is initialized to Jag_minto handle the extreme case that every 


tO within the search range equals MIN_32. The C code is modified as shown 
in Example 5-36. 


Example 5-36. C Code With the Comparison Order Changed 


} 


{ 
max = t0; 
p_max = i; 
} 
} 


max = MIN_32; 
p_max = lag_min; 
for (i = lag_min; i < lag_max; itt) 
{ 
p = scal_sig; 
pl = &scal_sig[-i]; 
tO = 0; 
for (j=0; j<L_frame; j++, *ptt, *pl++) { 
tO = L_mac(t0, *p, *pl); 


if (tO > max) 


Next, look at the inner loop, a general MAC loop. Since *o does not always equal 
“p71, it does not fall into the special case shown in subsection 5.2.1 on page 5-3. 


Therefore, the performance cannot be improved by simply unrolling the inner 
loop. 


Now consider unrolling the outer loop once. The C code with outer loop 
unrolling is shown in Example 5-37. Notice that since the number of lags that 
needs to be searched within each search range is always even, such unrolling 
will not create an additional case to handle. 
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Example 5-37. C Code With Outer Loop Unrolling 


Word32 t1; 


max = MIN_32; 


p_max = lag_min; 
for (i = lag_min; i < lag_max; it=2) 
{ 

p = scal_sig; 

pl = scal_sig[-i]; 

to = 0; 

tl = 0; 


tl=_sadd(t1,_smpy (*p,*-pl)); 
t0=_sadd(t0,_smpy(*p,*pl)); 
} 
if (tO > max) 
{ 
max = t0; 
p_max = i; 
} 
if( tl > max) 
{ 
max = tl; 
p_max = itl; 


with intrinsics substitutes. 


for (j=0; j3<L_frame; jt+t+, ptt, pl+t) 


(or tl=_sadd(t1,_smpy (scal_sig[j],scal_sig[-i-1+]j])) 
(or t0=_sadd(t0,_smpy(scal_sig[j],scal_sig[-i+j])) 


Notice that in the order of the comparisons, the smaller lag is always compared 


first. 


The instructions required for one iteration of the inner loop are shown in 


Example 5-38. 


Example 5-38. Inner Loop Instructions 


INNERLOOP : 

LDH .D ept+, sig 

LDH aD) *-pl, scalijl 

SMP Y .M sigj,scalijl,tmpl 

SADD ~L t1,tmpl,tl 

LDH .D *plt++,scalij 

SMPY .M sigj,scalij,tmp0 

SADD .L t0,tmp0,to 
[icntr] SUB Ps) Lontr;,-1,lentr 
[icntr] B 5 INNERLOOP 


ee 


load scal_sig[j] 

load scal_sig[-i-1+j] 

smpy (scal_sig[j],scal_sig[-i-1+]j] 
tl=sadd(t1,smpy (scal_sig[j],scal_sig[-i-1+j4]) 
load scalisig(-i+7] 

smpy (scal_sig[j],scal_sig[-i+j]) 
t0=sadd(t0,smpy(scal_sig[j],scal_sig[-it+j]) 
decrement inner loop counter 

branch to inner loop 
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The .D unit is the unit used the most (three times). Therefore, the inner loop 
takes two cycles. 


Now unroll the inner loop once. Notice the first iteration of t? and the last 
iteration of tO perform outside the inner loop. This avoids memory bank hits. 
The C code with inner loop unrolled is shown in Example 5-39. 


Example 5-39. Search Code With Inner and Outer Loops Unrolled 


Word32 t1; 


max = MIN_32; 


p_max = lag_min; 
for (i = lag_min; i < lag_max; i+t=2) 
{ 
p = scal_sig; 
pl\= scal_sio(-ily 
tO = 0; 
t1l=_sadd(t1,_smpy(*p,*-pl)); (or tl=_sadd(t1,_smpy(scal_sig[j],scal_sig[-i-1+j])) 
for (j=0; j<(L_frame-1); j+=2, pt=2, plt+=2) { 
t0=_sadd(t0,_smpy (*p,*pl)); (or t0=_sadd(t0,_smpy(scal_sig[j],scal_sig[-itj])) 
tl=_sadd(t1l,_smpy(*+p,*pl)); (or tl=_sadd(t1,_smpy(scal_sig[j+1],scal_sig[-i+j])) 
t0=_sadd(t0,_smpy(*+p,*+pl)); (or t0=_sadd(t0,_smpy (scal_sig[j+1],scal_sig[-i+j+1]) 
tl=_sadd(t1,_smpy(*+p[2],*+pl)); (or tl=_sadd(t1l,_smpy (scal_sig[j+2],scal_sig[-i+j+1])) 
} 
t0=_sadd(t0,_smpy (scal_sig[L_frame-1],scal_sig[-i+L_frame-1])); 
if (tO > max) 
max = t0; 
p_max = i; 


Although five values of scal_sig, [scal_sig[j], scal_sig[j+1], scal_sig[j+2], 
scal_sig/—i+j], and scal_sig/—i+j+1] are required for each inner loop iteration, 
scal_sig[j] does not need to be loaded, since it was loaded in the previous 
iteration. This means only four loads are required per iteration. Example 5—40 
gives the instructions for the modified inner loop. 
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Example 5—40. Inner Loop Instructions 


LDH D *pt+, sigj ; load scal_sig[j] 

LDH so *<pl, sealagl 7; Load scal_sig[-i-1+9] 

SMPY M sig}, scalijl,tl ; tl=smpy(scal_sig[j],scal_sig[-i-1+)j]) 
INNERLOOP : 

LDH D *pl++, 6ealag 3 load scalvsig[<-it7] 

SMPY M  sigj,scalij,tmp0 ; smpy(scal_sig[j],scal_sig[-itj]) 

SADD L t0,tmp0,t0 ; t0=sadd(t0,smpy(scal_sig[j],scal_sig[-itj]) 

LDH Di eee, “sig ed. + Load. scalsig( {+1 

SMPY M  sigjt+1,scalij,tmpl ; smpy(scal_sig[j+1],scal_sig[-it+j]) 

SADD .L t1,tmpl1,tl ; tl=sadd(t1,smpy (scal_sig[j+1],scal_sig[-i+j]) 

LDH sD. *pl++, scali 741 ; toad scalveng[<1F7+1) 

SMPY .M_ sigj+1,scalij+1,tmp0 ; smpy(scal_sig[j+1],scal_sig[-it+j+1]) 

SADD L t0,tmp0,t0 ; t0=sadd(t0,smpy (scal_sig[j+1],scal_sig[-i+j+1]) 

LDH ott, Sig pt2 ; load scal_sig[j+2], the scal_sig[j] for the 

; next iteration 

SMPY .M_ sigj+2,scalij+1,tmpl ; smpy(scal_sig[j+2],scal_sig[-it+j+1]) 

SADD .L t1,tmpl1,tl ; tl=sadd(t1,smpy (scal_sig[j+2],scal_sig[-i+j+1]) 
[icntr] SUB -S  icntr,2,icnt ; decrement inner loop counter 
[icntr] B -S  INNERLOOP ; branch to inner loop 


The inner loop uses two cycles. You double the performance, therefore, by 
unrolling both the outer loop and inner loop if no memory bank hits occur. 


5.2.7.2 Avoiding Memory Bank Hits 


Load scal_sig/-i+j] and scal_sig/j+1] together and scal_sig/—i+j+1] and 
scal_sig[j+2] together to avoid memory-bank hits. 


5.2.7.3 The Final Assembly Code for Lag Search 


The final assembly code for the lag search segment is shown in 
Example 5-41. 
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Example 5—41. Final Assembly Code for Lag Search in Lag_max() 


KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKKKKKKKKKKKKKKKK KKK KK KKK 
ad ** 
ee Implementation of residu.c EFR ae 
k* k* 
oe Compare two lags a time ie! 
k* k* 
el Total cycles = 7+(L_framet+6) * (lag_max-lag_min+1) /2 ss 
wK* k* 
aH Register Usage: A B ** 
k* 10 9 k* 
k* k* 
KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KEKEKKK KKK KKK KKK 
; A4 --- &scal_sig 
; A6é --- lag_max 
; B6é --- lag_min 
SUBAH D A4,A6,A7 ; pl=&scal_sig[-LAG_MIN] 
| MVK S2 1,B2 
| SUB L1X B6,A6,Al the outer loop counter 
| MV .L2X A4,B7 p=&scal_sig[0] 
| MPY .M2 BO,0,BO initialize the comparison result 
| MPY .M1 A2,0,A2 take care the initial iteration 
| MV +S A6,A4 p_max = lag_min 
SHL “oe B2,31,B2 max=MIN_32=0x80000000L 
| LDH -DI *-A7[1],A5 scal_sig[-LAG_MIN-1] 
| LDH .D2 *B7,B5 scal_sig[0] 
| ADD .L1 Al,1,Al make the counter to be an even number 
OUTERLOOP : 
LDH “D1. *A7,A5 scal_sig[-LAG_MIN] 
LDH PSB 7 *+BI [1] 7,56 scal_sig[1] 
[A2] SADD ~L2 B10,B8,B10 
[Al] MV 22 3:7, Bi inner loop counter 
MPY -M1 A3,0,A3 
MPY .M2 B8,0,B8 
ADD .S1 AT,2,A9 &scal_sig[-LAG_MIN+1] 
SUB -L1 A7,4,A7 update pl = &scal_sig[-LAG_MIN-2] 
LDH .D1 *A9++,A5 scal_sig[-LAG_MIN+1] 
LDH D2 *+B7[2],B5 scal_sig[2] 
[B1] B «Ss INNERLOOP branch to the inner loop 
[A2] CMPGT .L2 B10,B2,B0 if (t0>max) 
LDH ro *AQ9++,A5 ; scal_sig[-LAG_MIN+2] 
LDH .D2 *+B7[3],B6 ; scal_sig[3] 
[BO] MV L2 B10,B2 ; max = tO 
MPY .M1X B1,1,A2 7 counter to branch to the outerloop 


5-60 


Implementation of GSM EFR Vocoder 


Example 5—41. Final Assembly Code for Lag Search in Lag_max() (Continued) 


[!A2] 


[!A1] 


FINISH: 


NO 


*A9++,A5 
*+B7[4],B5 
INNERLOOP 
A0,B2,B0 
A6,2,A4 
A6,2,A6 
AO,0,A0 
B10,0,B10 


*A9++,A5 
*4+B7([5],B6 
A5,B5,A3 
AO,B2 
A6,3,A4 
Al,2,Al 
B7,12,B9 


*A9++,A5 
*B9++4+,B5 
A5,B6,A3 
A5,B5,B8 
AO, A3,A0 
B10,B8,B10 
INNERLOOP 
Bl,1,Bl 


*A9++,A5 
*BO++,B6 
A5,B5,A3 
A5,B6,B8 
AO,A3,A0 
B10,B8,B10 
A2,1,A2 
OUTERLOOP 


*-AT7[1],A5 
*B7,B5 
AO0,A3,A0 
B10,B8,B10 
FINISH 


scal_sig[-LAG_MIN+3] 
scal_sig[4] 

branch to the inner loop 
if (t1>max) 

p_max = i 

update i 

initialize t1=0 
initialize t0=0 


scal_sig[-LAG_MIN+4] 
scal_sig[5] 

_smpy (scal_sig[-LAG_MIN-1], 
max = tl 

p_max = itl 

update inner loop counter 
&scal_sig[1] 


scal_sig[0]) 


scal_sig[-LAG_MI 
scal_sig[6] 
_smpy (scal_sig[-LAG_MIN], 
_smpy (scal_sig[-LAG_MIN], 
update tl 

update t0 

branch to inner loop 
decrement inner loop counter 


+55] 


scal_sig[1]) 
scal_sig[0]) 


scal_sig[-LAG_MI 
scal_sig[7] 
_smpy (scal_sig[-LAG_MIN+1], 
_smpy (scal_sig[-LAG_MIN+1], 
update tl 

update t0 

decrement the counter to branch to the outer loop 
branch to the outer loop 


+6] 


scal_sig[2]) 
scal_sig[1]) 


scal_sig[-LAG_MIN-3] 
scal_sig[0] 

update tl 

update t0 

lag search is complete 


Notice that all the epilogs and prelogs of the outer and inner loops are com- 
pressed to minimize the code size. A2 is both the indicator for avoiding 
comparisons during the initial iteration of the outer loop and the counter for 
branching to the outer loop during inner loop executions. 
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